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Preface to the First Edition. 


This text grew out of my lecturing the Graduate Probability sequence 
STA 2111F / 2211S at the University of Toronto over a period of several 
years. During this time, it became clear to me that there are a large number 
of graduate students from a variety of departments (mathematics, statistics, 
economics, management, finance, computer science, engineering, etc.) who 
require a working knowledge of rigorous probability, but whose mathemat- 
ical background may be insufficient to dive straight into advanced texts on 
the subject. 

This text is intended to answer that need. It provides an introduction 
to rigorous (i.e., mathematically precise) probability theory using measure 
theory. At the same time, I have tried to make it brief and to the point, and 
as accessible as possible. In particular, probabilistic language and perspec- 
tive are used throughout, with necessary measure theory introduced only 
as needed. 

I have tried to strike an appropriate balance between rigorously covering 
the subject, and avoiding unnecessary detail. The text provides mathemat- 
ically complete proofs of all of the essential introductory results of proba- 
bility theory and measure theory. However, more advanced and specialised 
areas are ignored entirely or only briefly hinted at. For example, the text 
includes a complete proof of the classical Central Limit Theorem, including 
the necessary Continuity Theorem for characteristic functions. However, 
the Lindeberg Central Limit Theorem and Martingale Central Limit The- 
orem are only briefly sketched and are not proved. Similarly, all necessary 
facts from measure theory are proved before they are used. However, more 
abstract and advanced measure theory results are not included. Further- 
more, the measure theory is almost always discussed purely in terms of 
probability, as opposed to being treated as a separate subject which must 
be mastered before probability theory can be studied. 

I hesitated to bring these notes to publication. There are many other 
books available which treat probability theory with measure theory, and 
some of them are excellent. For a partial list see Subsection B.3 on page 
210. (Indeed, the book by Billingsley was the textbook from which I taught 
before I started writing these notes. While much has changed since then, the 
knowledgeable reader will still notice Billingsley’s influence in the treatment 
of many topics herein. The Billingsley book remains one of the best sources 
for a complete, advanced, and technically precise treatment of probability 
theory with measure theory.) In terms of content, therefore, the current 
text adds very little indeed to what has already been written. It was only 
the reaction of certain students, who found the subject easier to learn from 
my notes than from longer, more advanced, and more all-inclusive books, 
that convinced me to go ahead and publish. The reader is urged to consult 
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other books for further study and additional detail. 

There are also many books available (see Subsection B.2) which treat 
probability theory at the undergraduate, less rigorous level, without the use 
of general measure theory. Such texts provide intuitive notions of probabil- 
ities, random variables, etc., but without mathematical precision. In this 
text it will generally be assumed, for purposes of intuition, that the stu- 
dent has at least a passing familiarity with probability theory at this level. 
Indeed, Section 1 of the text attempts to link such intuition with the math- 
ematical precision to come. However, mathematically speaking we will not 
require many results from undergraduate-level probability theory. 


Structure. The first six sections of this book could be considered to 
form a “core” of essential material. After learning them, the student will 
have a precise mathematical understanding of probabilities and ø-algebras; 
random variables, distributions, and expected values; and inequalities and 
laws of large numbers. Sections 7 and 8 then diverge into the theory of 
gambling games and Markov chain theory. Section 9 provides a bridge 
to the more advanced topics of Sections 10 through 14, including weak 
convergence, characteristic functions, the Central Limit Theorem, Lebesgue 
Decomposition, conditioning, and martingales. 

The final section, Section 15, provides a wide-ranging and somewhat 
less rigorous introduction to the subject of general stochastic processes. It 
leads up to diffusions, It6’s Lemma, and finally a brief look at the famous 
Black-Scholes equation from mathematical finance. It is hoped that this 
final section will inspire readers to learn more about various aspects of 
stochastic processes. 

Appendix A contains basic facts from elementary mathematics. This 
appendix can be used for review and to gauge the book’s level. In addition, 
the text makes frequent reference to Appendix A, especially in the earlier 
sections, to ease the transition to the required mathematical level for the 
subject. It is hoped that readers can use familiar topics from Appendix A 
as a springboard to less familiar topics in the text. 

Finally, Appendix B lists a variety of references, for background and for 
further reading. 


Exercises. The text contains a number of exercises. Those very closely 
related to textual material are inserted at the appropriate place. Additional 
exercises are found at the end of each section, in a separate subsection. 
I have tried to make the exercises thought provoking without being too 
difficult. Hints are provided where appropriate. Rather than always asking 
for computations or proofs, the exercises sometimes ask for explanations 
and/or examples, to hopefully clarify the subject matter in the student’s 
mind. 


Prerequisites. As a prerequisite to reading this text, the student should 
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have a solid background in basic undergraduate-level real analysis (not in- 
cluding measure theory). In particular, the mathematical background sum- 
marised in Appendix A should be very familiar. If it is not, then books 
such as those in Subsection B.1 should be studied first. It is also helpful, 
but not essential, to have seen some undergraduate-level probability theory 
at the level of the books in Subsection B.2. 


Further reading. For further reading beyond this text, the reader should 
examine the similar but more advanced books of Subsection B.3. To learn 
additional topics, the reader should consult the books on pure measure 
theory of Subsection B.4, and/or the advanced books on stochastic processes 
of Subsection B.5, and/or the books on mathematical finance of Subsection 
B.6. I would be content to learn only that this text has inspired students 
to look at more advanced treatments of the subject. 
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aging me in this direction, in particular Mike Evans, Andrey Feuerverger, 
Keith Knight, Omiros Papaspiliopoulos, Jeremy Quastel, Nancy Reid, and 
Gareth Roberts. Most importantly, I would like to thank the many stu- 
dents who have studied these topics with me; their questions, insights, and 
difficulties have been my main source of inspiration. 


Jeffrey S. Rosenthal 

Toronto, Canada, 2000 
jeff@math.toronto.edu 
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Preface to the Second Edition. 


I am pleased to have the opportunity to publish a second edition of this 
book. The book’s basic structure and content are unchanged; in particular, 
the emphasis on establishing probability theory’s rigorous mathematical 
foundations, while minimising technicalities as much as possible, remains 
paramount. However, having taught from this book for several years, I 
have made considerable revisions and improvements. For example: 

e Many small additional topics have been added, and existing topics ex- 
panded. As a result, the second edition is over forty pages longer than 
the first. 

e Many new exercises have been added, and some of the existing exercises 
have been improved or “cleaned up”. There are now about 275 exercises 
in total (as compared with 150 in the first edition), ranging in difficulty 
from quite easy to fairly challenging, many with hints provided. 

e Further details and explanations have been added in steps of proofs 
which previously caused confusion. 

e Several of the longer proofs are now broken up into a number of lemmas, 
to more easily keep track of the different steps involved, and to allow 
for the possibility of skipping the most technical bits while retaining the 
proof’s overall structure. 

e A few proofs, which are required for mathematical completeness but 
which require advanced mathematics background and/or add little un- 
derstanding, are now marked as “optional”. 

e Various interesting, but technical and inessential, results are presented 
as remarks or footnotes, to add information and context without inter- 
rupting the text’s flow. 

e The Extension Theorem now allows the original set function to be de- 
fined on a semialgebra rather than an algebra, thus simplifying its ap- 
plication and increasing understanding. 

e Many minor edits and rewrites were made throughout the book to im- 
prove the clarity, accuracy, and readability. 


Ithank Ying Oi Chiew and Lai Fun Kwong of World Scientific for facilitating 
this edition, and thank Richard Dudley, Eung Jun Lee, Neal Madras, Peter 
Rosenthal, Hermann Thorisson, and Bálint Virág for helpful comments. 
Also, I again thank the many students who have studied and discussed 
these topics with me over many years. 


Jeffrey S. Rosenthal 
Toronto, Canada, 2006 
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1. The need for measure theory. 


This introductory section is directed primarily to those readers who have 
some familiarity with undergraduate-level probability theory, and who may 
be unclear as to why it is necessary to introduce measure theory and other 
mathematical difficulties in order to study probability theory in a rigorous 
manner. 

We attempt to illustrate the limitations of undergraduate-level proba- 
bility theory in two ways: the restrictions on the kinds of random variables 
it allows, and the question of what sets can have probabilities defined on 
them. 


1.1. Various kinds of random variables. 


The reader familiar with undergraduate-level probability will be com- 
fortable with a statement like, “Let X be a random variable which has the 
Poisson(5) distribution.” The reader will know that this means that X 
takes as its value a “random” non-negative integer, such that the integer 
k > 0 is chosen with probability P(X = k) = e~°5*/k!. The expected value 
of, say, X?, can then be computed as E(X?) = E? o k2e755*/k!. X is an 
example of a discrete random variable. 

Similarly, the reader will be familiar with a statement like, “Let Y be 
a random variable which has the Normal(0,1) distribution.” This means 
that the probability that Y lies between two real numbers a < b is given 
by the integral P(a < Y < b) = J? Fee Way, (On the other hand, 
P(Y = y) = 0 for any particular real number y.) The expected value of, 
say, Y?, can then be computed as E(Y?) = JZ y? ae Vay. Y is an 
example of an absolutely continuous random variable. 

But now suppose we introduce a new random variable Z, as follows. 
We let X and Y be as above, and then flip an (independent) fair coin. 
If the coin comes up heads we set Z = X, while if it comes up tails we 
set Z = Y. In symbols, P(Z = X) = P(Z = Y) 1/2. Then what 
sort of random variable is Z? It is not discrete, since it can take on an 
uncountable number of different values. But it is not absolutely continuous, 
since for certain values z (specifically, when z is a non-negative integer) we 
have P(Z = z) > 0. So how can we study the random variable Z? How 
could we compute, say, the expected value of Z2? 

The correct response to this question, of course, is that the division 
of random variables into discrete versus absolutely continuous is artificial. 
Instead, measure theory allows us to give a common definition of expected 
value, which applies equally well to discrete random variables (like X above), 
to continuous random variables (like Y above), to combinations of them (like 


2 1. THE NEED FOR MEASURE THEORY. 


Z above), and to other kinds of random variables not yet imagined. These 
issues are considered in Sections 4, 6, and 12. 


1.2. The uniform distribution and non-measurable sets. 


In undergraduate-level probability, continuous random variables are of- 
ten studied in detail. However, a closer examination suggests that perhaps 
such random variables are not completely understood after all. 

To take the simplest case, suppose that X is a random variable which 
has the uniform distribution on the unit interval [0,1]. In symbols, X ~ 
Uniform|(0, 1]. What precisely does this mean? 

Well, certainly this means that P(0 < X < 1) = 1 It also means that 
P(0 < X < 1/2) = 1/2, that P(3/4 < X < 7/8) = 1/8, etc., and in general 
that P(a < X < b) = b — a whenever 0 < a < b < 1, with the same formula 
holding if < is replaced by <. We can write this as 


P([a,b]) = P((a,b]) = P([a,b)) = P((a,b)) =b-a, O<a<b<ı. 
(1.2.1) 
In words, the probability that X lies in any interval contained in [0,1] is 
simply the length of the interval. (We include in this the degenerate case 
when a = b, so that P({a}) = 0 for the singleton set {a}; in words, the 
probability that X is equal to any particular number a is zero.) 
Similarly, this means that 


P(1/4< X < 1/2 or 2/3 < X < 5/6) 


= P(1/4< X < 1/2) + P(2/3< X < 5/6) = 1/4+1/6 = 5/12, 


and in general that if A and B are disjoint subsets of [0,1] (for example, if 
A = [1/4, 1/2] and B = [2/3, 5/6]), then 


P(AUB) = P(A) + P(B). (1.2.2) 


Equation (1.2.2) is called finite additivity. 

Indeed, to allow for countable operations (such as limits, which are ex- 
tremely important in probability theory), we would like to extend (1.2.2) to 
the case of a countably infinite number of disjoint subsets: if A1, A, A3,... 
are disjoint subsets of [0,1], then 


P (A, U A U A3 U.. .) = P(A;) + P(Ag) + P(As) aE (1.2.3) 


Equation (1.2.3) is called countable additivity. 
Note that we do not extend equation (1.2.3) to uncountable additivity. 
Indeed, if we did, then we would expect that P([0,1]) = Yzel, PZH, 
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which is clearly false since the left-hand side equals 1 while the right-hand 
side equals 0. (There is no contradiction to (1.2.3) since the interval [0, 1] is 
not countable.) It is for this reason that we restrict attention to countable 
operations. (For a review of countable and uncountable sets, see Subsec- 
tion A.2. Also, recall that for non-negative uncountable collections {ra}eer, 
Y aer Ta is defined to be the supremum of )>,¢; 7a over finite J C I.) 

Similarly, to reflect the fact that X is “uniform” on the interval [0,1], 
the probability that X lies in some subset should be unaffected by “shifting” 
(with wrap-around) the subset by a fixed amount. That is, if for each subset 
A C [0,1] we define the r-shift of A by 


A@r={a+r;aeA, atr<1}U {a+r—1;a€A, a+r > 1}, (1.2.4) 
then we should have 
P(A@r) = P(A), O<r<l. (1.2.5) 


So far so good. But now suppose we ask, what is the probability that X 
is rational? What is the probability that X” is rational for some positive 
integer n? What is the probability that X is algebraic, i.e. the solution to 
some polynomial equation with integer coefficients? Can we compute these 
things? More fundamentally, are all probabilities such as these necessarily 
even defined? That is, does P(A) (i.e., the probability that X lies in the 
subset A) even make sense for every possible subset A C [0, 1]? 

It turns out that the answer to this last question is no, as the following 
proposition shows. The proof requires equivalence relations, but can be 
skipped if desired since the result is not used elsewhere in this book. 


Proposition 1.2.6. There does not exist a definition of P(A), defined 
for all subsets A C [0,1], satisfying (1.2.1) and (1.2.3) and (1.2.5). 


Proof (optional). Suppose, to the contrary, that P(A) could be so 
defined for each subset A C [0,1]. We will derive a contradiction to this. 

Define an equivalence relation (see Subsection A.5) on [0,1] by: £ ~ y 
if and only if the difference y — x is rational. This relation partitions the 
interval [0, 1] into a disjoint union of equivalence classes. Let H be a subset 
of [0, 1] consisting of precisely one element from each equivalence class (such 
H must exist by the Axiom of Choice, see page 200). For definiteness, 
assume that 0 ¢ H (say, if 0 € H, then replace it by 1/2). 

Now, since H contains an element of each equivalence class, we see that 
each point in (0,1] is contained in the union U (H @r) of shifts of H. 


r€(0,1) 
r rational 


Furthermore, since H contains just one point from each equivalence class, 
we see that these sets H ® r, for rational r € [0, 1), are all disjoint. 
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But then, by countable additivity (1.2.3), we have 


P((0,1])= ` P(Her 


reé[0,1) 


r rational 


Shift-invariance (1.2.5) implies that P(H @r) = P(H), whence 


=P(0,1)= $` PH 


re[0,1) 
r rational 
This leads to the desired contradiction: A countably infinite sum of the same 
quantity repeated can only equal 0, or oo, or —oo, but it can never equal 1. E 


This proposition says that if we want our probabilities to satisfy rea- 
sonable” properties, then we cannot define them for all possible subsets of 
(0, 1]. Rather, we must restrict their definition to certain “measurable” sets. 
This is the motivation for the next section. 


Remark. The existence of problematic sets like H above turns out to be 
equivalent to the Axiom of Choice. In particular, we can never define such 
sets explicitly — only implicitly via the Axiom of Choice as in the above 
proof. 


1.3. Exercises. 


Exercise 1.3.1. Suppose that 2 = {1,2}, with P(Ø) = 0 and P{1, 2} =1. 
Suppose P{1} = 4. Prove that P is countably additive if and only if 
P{2} = 3. 


Exercise 1.3.2. Suppose 2 = {1, 2,3} and F is the collection of all sub- 
sets of Q. Find (with proof) necessary and sufficient conditions on the real 
numbers g, y, and z, such that there exists a countably additive probability 
measure P on F, with z = P{1,2}, y = P{2,3}, and z = P{1, 3}. 


Exercise 1.3.3. Suppose that Q = N is the set of positive integers, and 
P is defined for all A C Q by P(A) = Oif A is finite, and P(A) = 1 if A is 
infinite. Is P finitely additive? 


Exercise 1.3.4. Suppose that Q = N, and P is defined for all A C 
Q by P(A) = |A| if A is finite (where |A| is the number of elements in 


“In fact, assuming the Continuum Hypothesis, Proposition 1.2.6 continues to hold if 
we require only (1.2.3) and that 0 < P((0,1]) < oo and P{x} = 0 for all z; see e.g. 
Billingsley (1995, p. 46). 
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the subset A), and P(A) = oo if A is infinite. This P is of course not a 
probability measure (in fact it is counting measure), however we can still 
ask the following. (By convention, 00 + 00 = oo.) 

(a) Is P finitely additive? 

(b) Is P countably additive? 


Exercise 1.3.5. (a) In what step of the proof of Proposition 1.2.6 
was (1.2.1) used? 

(b) Give an example of a countably additive set function P, defined on ail 
subsets of [0,1], which satisfies (1.2.3) and (1.2.5), but not (1.2.1). 


1.4. Section summary. 


In this section, we have discussed why measure theory is necessary to 
develop a mathematical rigorous theory of probability. We have discussed 
basic properties of probability measures such as additivity. We have consid- 
ered the possibility of random variables which are neither absolutely con- 
tinuous nor discrete, and therefore do not fit easily into undergraduate-level 
understanding of probability. Finally, we have proved that, for the uniform 
distribution on [0,1], it will not be possible to define a probability on every 
single subset. 
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2. Probability triples. 


In this section we consider probability triples and how to construct them. 
In light of the previous section, we see that to study probability theory prop- 
erly, it will be necessary to keep track of which subsets A have a probability 
P(A) defined for them. 


2.1. Basic definition. 


We define a probability triple or (probability) measure space or probability 
space to be a triple (Q, F, P), where: 

e the sample space 2 is any non-empty set (e.g. Q = [0,1] for the uniform 
distribution considered above); 

e the o-algebra (read “sigma-algebra”) or o-field (read “sigma-field”) F 
is a collection of subsets of Q, containing Q itself and the empty set Ø, 
and closed under the formation of complements’ and countable unions 
and countable intersections (e.g. for the uniform distribution considered 
above, F would certainly contain all the intervals [a,b], but would con- 
tain many more subsets besides); 

e the probability measure P is a mapping from F to [0,1], with P(@) = 0 
and P(Q) = 1, such that P is countably additive as in (1.2.3). 


This definition will be in constant use throughout the text. Further- 
more it contains a number of subtle points. Thus, we pause to make a few 
additional observations. 

The o-algebra F is the collection of all events or measurable sets. These 
are the subsets A C Q for which P(A) is well-defined. We know from Propo- 
sition 1.2.6 that in general F might not contain all subsets of Q, though we 
still expect it to contain most of the subsets that come up naturally. 

To say that F is closed under the formation of complements and count- 
able unions and countable intersections means, more precisely, that 
(i) For any subset A C Q, if A € F, then AF € F; 

(ii) For any countable (or finite) collection of subsets Ai, A2, A3,... C Q, if 

A; E F for each 7, then the union A; U A2 U A3U...€ F; 

(iii) For any countable (or finite) collection of subsets Ai, A2, A3,...C Q, if 

A; € F for each i, then the intersection A1 N A2 N A3N... € F. 

Like for countable additivity, the reason we require F to be closed under 
countable operations is to allow for taking limits, etc., when studying prob- 
ability theory. Also like for additivity, we cannot extend the definition to 


“For the definitions of complements, unions, intersections, etc., see Subsection A.1 on 
page 199. 
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require that F be closed under uncountable unions; in this case, for the ex- 
ample of Subsection 1.2 above, F would contain every subset A, since every 
subset can be written as A = [J ea {z} and since the singleton sets {z} are 
all in F. 

There is some redundancy in the definition above. For example, it fol- 
lows from de Morgan’s Laws (see Subsection A.1) that if F is closed under 
complement and countable unions, then it is automatically closed under 
countable intersections. Similarly, it follows from countable additivity that 
we must have P(@) = 0, and that (once we know that P(Q) = 1 and 
P(A) > 0 for all A € F) we must have P(A) < 1. 

More generally, from additivity we have P(A) + P(A®) = P(AU AC) = 
P(Q) = 1, whence l 

P(A) = 1- P(A), (2.1.1) 


a fact that will be used often. Similarly, if A C B, then since B = A Ô (B\A) 
(where U means disjoint union), we have that P(B) = P(A)+P(B\ A) > 
P(A), i.e. 

P(A) < P(B) whenever AC B, (2.1.2) 


which is the monotonicity property of probability measures. 
Also, if A, B € F, then 


P(AU B) P[(A\ B)U(B\ A)U(ANB)| 
P(A\ B)+P(B\ A) +P(ANB) 
P(A) — P(AN B) + P(B) ~ P(ANB)+ P(ANB) 


P(A) + P(B) — P(ANB), 


the principle of inclusion-exclusion. For a generalisation see Exercise 4.5.7. 
Finally, for any sequence Aj, Ag,... E€ F (whether disjoint or not), we 
have by countable additivity and monotonicity that 


P(A, UA2UA3U...) = P[A Ù (42 \ A1)U(A3 \ Az \ 41) 0... ] 
P (Ai) + P (A2 \ 41) + P (A3 \ A2 \ Ar) +... 
P(A) + P(42) + P(A3) +... 


IA Il 


which is the countable subadditivity property of probability measures. 


2.2. Constructing probability triples. 


We clarify the definition of Subsection 2.1 with a simple example. Let us 
again consider the Poisson(5) distribution considered in Subsection 1.1. In 
this case, the sample space 2 would consist of all the non-negative integers: 
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Q = {0,1,2,...}. Also, the o-algebra F would consist of all subsets of Q. 
Finally, the probability measure P would be defined, for any A € F, by 


P(A) = $ e™ř5"/kl. 


keA 


It is straightforward to check that F is indeed a o-algebra (it contains 
all subsets of Q, so it’s closed under any set operations), and that P is a 
probability measure defined on F (the additivity following since if A and B 
are disjoint, then °c 4yp is the same as Jpeca +rep) 

So in the case of Poisson(5), we see that it is entirely straightforward 
to construct an appropriate probability triple. The construction is similarly 
straightforward for any discrete probability space, i.e. any space for which 
the sample space Q is finite or countable. We record this as follows. 


Theorem 2.2.1. Let Q be a finite or countable non-empty set. Let 
p:Q— [0,1] be any function satisfying cg p(w) = 1. Then there is a 
valid probability triple (Q, F, P) where F is the collection of all subsets of 
Q, and for A € F, P(A) = Puea p(w). 


Example 2.2.2. Let Q be any finite non-empty set, F be the collection 
of all subsets of Q, and P(A) = |A|/|Q| for all A € F (where |A| is the 
cardinality of the set A). Then (Q, F,P) is a valid probability triple, called 
the uniform distribution on Q, written Uniform(2). 


However, if the sample space is not countable, then the situation is 
considerably more complex, as seen in Subsection 1.2. How can we for- 
mally define a probability triple (Q, F, P) which corresponds to, say, the 
Uniform(0, 1] distribution? 

It seems clear that we should choose Q = [0,1]. But what about F? We 
know from Proposition 1.2.6 that F cannot contain all intervals of Q, but it 
should certainly contain all the intervals [a,b], [a, b), etc. That is, we must 
have F D J, where 


J = {all intervals contained in [0, 1]} 


and where “intervals” is understood to include all the open / closed / half- 
open / singleton / empty intervals. 


Exercise 2.2.3. Prove that the above collection J is a semialgebra of 
subsets of Q, meaning that it contains @ and Q, it is closed under finite 
intersection, and the complement of any element of 7 is equal to a finite 
disjoint union of elements of J. 
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Since J is only a semialgebra, how can we create a o-algebra? As a first 
try, we might consider 


Bo = {all finite unions of elements of 7}. (2.2.4) 


(After all, by additivity we already know how to define P on By.) However, 
Bo is not a o-algebra: 


Exercise 2.2.5. (a) Prove that Bo is an algebra (or, field) of subsets 
of Q, meaning that it contains Q and @, and is closed under the formation 
of complements and of finite unions and intersections. 

(b) Prove that Bo is not a o-algebra. 


As a second try, we might consider 
B, = {all finite or countable unions of elements of J}. (2.2.6) 


Unfortunately, B, is still not a o-algebra (Exercise 2.4.7). 

Thus, the construction of F, and of P, presents serious challenges. To 
deal with them, we next prove a very general theorem about constructing 
probability triples. 


2.3. The Extension Theorem. 


The following theorem is of fundamental importance in constructing 
complicated probability triples. Recall the definition of semialgebra from 
Exercise 2.2.3. 


Theorem 2.3.1. (The Extension Theorem.) Let J be a semialgebra of 
subsets of Q. Let P : J — [0,1] with P(@) = 0 and P(Q) = 1, satisfying 
the finite superadditivity property that 


k k k 
p (Ùa; >>> P(Ai) whenever A1,..., Ák E J, and UAE, 
i=1 iel i=1 
and the {A;} are disjoint , (2.3.2) 
and also the countable monotonicity property that 


P(A) < 5 >P(A,) for A,Ai,A2,...€ J with AC| JAn. (2.3.3) 


Then there is a o-algebra M D J, and a countably additive probability 
measure P* on M, such that P*(A) = P(A) for all A € J. (That is, 
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(Q,M,P*) is a valid probability triple, which agrees with our previous 
probabilities on 7.) 


Remark. Of course, the conclusions of Theorem 2.3.1 imply that (2.3.2) 
must actually hold with equality. However, (2.3.2) need only be verified as 
an inequality to apply Theorem 2.3.1. 


Theorem 2.3.1 provides precisely what we need: a way to construct 
complicated probability triples on a full o-algebra, using only probabilities 
defined on the much simpler subsets (e.g., intervals) in J. 

However, it is not clear how to even start proving this theorem. Indeed, 
how could we begin to define P(A) for all A in a o-algebra? The key is 
given by outer measure P*, defined by 


P*(A) = , jnf ., > P(4:), ACQ. (2.3.4) 


acl), A; 


That is, we define P*(A), for any subset A C Q, to be the infimum of 
sums of P(A;), where {A;} is any countable collection of elements of the 
original semialgebra J whose union contains A. In other words, we use 
the values of P(A) for A € J, to help us define P*(A) for any A C Q. 
Of course, we know that P* will not necessarily be a proper probability 
measure for all A C Q; for example, this is not possible for Uniform|0, 1] 
by Proposition 1.2.6. However, it is still useful that P*(A) is at least defined 
for all A C Q. We shall eventually show that P* is indeed a probability 
measure on some o-algebra M, and that P* is an extension of P. 

To continue, we note a few simple properties of P*. Firstly, we clearly 
-have P*(Ø) = 0; indeed, we can simply take A; = Ø for each i in the 
definition (2.3.4). Secondly, P* is clearly monotone; indeed, if A C B then 
the infimum (2.3.4) for P*(A) includes all choices of {A;} which work for 
P*(B) plus many more besides, so that P*(A) < P*(B). We also have: 


Lemma 2.3.5. P* is an extension of P, i.e. P*(A) = P(A) for all AE J. 
Proof. Let A € J. It follows from (2.3.3) that P*(A) > P(A). On the 


other hand, choosing A, = A and A; = @ for i > 1 in the definition (2.3.4) 
shows by (A.4.1) that P*(A) < P(A). | 


Lemma 2.3.6. P* is countably subadditive, i.e. 


r (Ü Ba) < XO P*(Ba) forany Bı, Bz... CQ. 


n=l 
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Proof. Let Bı, B2,... C Q. From the definition (2.3.4), we see that for 
any € > 0, we can find (cf. Proposition A.4.2) a collection {C,,,}?2, for each 
n € N, with Cng € J, such that B, CU, Cre and >>, P(Crz) < P*(Bn)+ 
€27”. But then the overall collection {Cr} Peat contains | J>; Bn. It fol- 
lows that P* (US; Bn) < Lane P(Cnk) < Xn P*(Bn) + €. Since this is 
true for any € > 0, we must have (cf. Proposition A.3.1) that P* (UZ; Bn) < 
>=, P* (Bn), as claimed. 


We now set 

M={A C9; P*(ANE)+P*(ACN E) =P*(E)YE C9}. (2.3.7) 
That is, M is the set of all subsets A with the property that P* is additive 
on the union of A N E with A? N E, for all subsets E. Note that by 
subadditivity we always have P* (AN E) + P*(A° N E) > P* (E), so (2.3.7) 
is equivalent to 

M={ACQ; P*(AN E) +P*(A° N E) < P*(E)YECQ9}, (2.3.8) 


which is sometimes helpful. Furthermore, P* is countably additive on M: 


Lemma 2.3.9. If Aı, A2,... € M are disjoint, then P* (U, An) = 
on P*(An). 


Proof. If A; and Ag are disjoint, with A; € M, then 


P* (A1 U A2) 
= P* (A1 N (A1 U A2)) + P* (AF N (A41 U A2)) since Aı €E M 
= P* (Ai) + P* (A2) since Ai, Ag disjoint . 


Hence, by induction, the lemma holds for any finite collection of Aj. 
Then, with countably many disjoint A; € M, we see that for any m € N, 


XO P(A) = P ( U An) < P(r), 


ním ním 


where the inequality follows from monotonicity. Since this is true for any 
m € N, we have (cf. Proposition A.3.6) that X0, P* (An) < P* (U,, An). 
On the other hand, by subadditivity we have }>,, P*(An) > P* (U,, An). 
Hence, the lemma holds for countably many A; as well. || 


The plan now is to show that M is a o-algebra which contains J. We 
break up the proof into a number of lemmas. 
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Lemma 2.3.10. M is an algebra, i.e. Q € M and M is closed under 
complement and finite intersection (and hence also finite union). 


Proof. It is immediate from (2.3.7) that M contains Q and is closed 
under complement. For the statement about finite intersections, suppose 
A,B € M. Then, for any E C 2, using subadditivity, 


P* ((AN B)NE)+P* ((ANB)O NE) 
= P* (AN BNE) 

+P* ((A°N BN E)U (AN BEN E)U (AFN BEN E)) 
< P* (AN BN E)+ P* (ACN BNE) 

+P* (AN BON E) + P* (ACN BON E)) 
=P*(BNE)+P*(BCNE) since AE M 
= P*(E) since BEM. 


Hence, by (2.3.8), AN B € M. E 


Lemma 2.3.11. Let Aı, Á2,... € M be disjoint. For each m € N, let 
Bm = Uncm An. Then for all m € N, and for all E C Q, we have 


P*(EN Bm) = X. P*(EN An). (2.3.12) 


n<m 


Proof. We use induction on m. Indeed, the statement is trivially true 
when m = 1. Assuming it true for some particular value of m, and noting 
that Bm N Bm41 = Bm and BS N Bm+1 = Am+1, we have (noting that 
Bm E€ M by Proposition 2.3.10) that 


P*(E O Bm+1) 
= P*(Bm N EN Bm41) + P*(BS OEN Bm41) since Bm € M 
= P*(E N Bm) + P*(E N Am+1) 
= Vin<mt P*(EN An) by the induction hypothesis , 


thus completing the induction proof. E 


Lemma 2.3.13. Let Aı, A2,... E€ M be disjoint. Then U,, An € M. 


Proof. For each m € N, let Bm = U 
any EGCQ, 


An. Then for any m € N and 


n<m 


P*(E) = P*(ENBm)+P*(ENBS) since Bm € M 
Snem P (EN An) +P*(EN BS) by (2.3.12) 


icm P*(EN An) F P*(E N (UAn)°) ? 


Iv Il 
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where the inequality follows by monotonicity since (UJ An)° C BG. This is 
true for any m € N, so it implies (cf. Proposition A.3.6) that 


P*(E) > > P*(EN An) +P* (zn (Ua?) 


> P* (znan) + P* (2n (47) ; 


n 


where the final inequality follows by subadditivity. Hence, by (2.3.8) we 
have U,, An E€ M. E 


Lemma 2.3.14. M is a g-algebra. 


Proof. In light of Lemma 2.3.10, it suffices to show that U,, An E€ M 
whenever Aj, A2,... E€ M. Let Dı = Ai, and D; = AN AF NA... N AL, 
for i > 2. Then {D;} are disjoint, with Uj; Di = U; Ai, and with D; € M 
by Lemma 2.3.10. Hence, by Lemma 2.3.13, U; Di E€ M, i.e. U; Ai EM. E 


Lemma 2.3.15. JCM. 


Proof. Let A € J. Then since J is a semialgebra, we can write AC = 
Ji Ċ ... Ù Jp for some disjoint J1, ..., Jy € J. Also, for any E C Q and e > 
0, by the definition (2.3.4) we can find (cf. Proposition A.4.2) A1, Ao,...€ J 
with E CU, An and 3°, P(An) < P*(£) +e. Then 


P*(EN A) +P*(EN AS) 

< P*((Up An) N A) + P*((U,, An) N AY) by monotonicity 
= P*(U,(4n N A)) + P* (Un Uii (An N J:)) 

<E, P*(4nN A) +d, 0%, P*(An NJ) - by subadditivity 
=>, P(A NA) +E, E P(A NJ) since P* = Pon J 
= Dn (P(An NA) + Yii P(An N 3) 

<E P(An) by (2.3.2) 

< P*(E) +e by assumption . 


This is true for any € > 0, hence (cf. Proposition A.3.1) we have P*(E N 
A) + P*(EN AY) < P*(E£), for any E CQ. Hence, from (2.3.8), we have 
A € M. This holds for any A € J, hence J C M. | 


With all those lemmas behind us, we are now, finally, able to complete 
the proof of the Extension Theorem. 
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Proof of Theorem 2.3.1. Lemmas 2.3.5, 2.3.9, 2.3.14, and 2.3.15 to- 
gether show that M is a o-algebra containing 7, that P* is a probability 
measure on M, and that P* is an extension of P. | 


Exercise 2.3.16. Prove that the extension (Q,M,P*) constructed in 
the proof of Theorem 2.3.1 must be complete, meaning that if A € M with 
P*(A) = 0, and if B C A, then B € M. (It then follows from monotonicity 
that P*(B) = 0.) 


2.4. Constructing the Uniform|(0, 1] distribution. 


Theorem 2.3.1 allows us to automatically construct valid probability 
triples which take particular values on particular sets. We now use this to 
construct the Uniform[0, 1] distribution. We begin by letting Q = [0,1], 
and again setting 


J = {all intervals contained in [0,1]}, (2.4.1) 


where again “intervals” is understood to include all the open, closed, half- 
open, and singleton intervals contained in [0,1], and also the empty set 9. 
Then J is a semialgebra by Exercise 2.2.3. 

For I € J, we let P(I) be the length of J. Thus P(0) = 0 and P(Q) = 1. 
We now proceed to verify (2.3.2) and (2.3.3). 


Proposition 2.4.2. The above definition of J and P satisfies (2.3.2), 
with equality. 


Proof. Let [1,..., I, be disjoint intervals contained in [0,1], whose union 
is some interval Jp. For 0 <j < k, write a; for the left end-point of J;, and 
b; for the right end-point of J;. The assumptions imply that by re-ordering, 
we can ensure that ao = a1 < bı = a2 < b2 = a3 <...< bk = bo. Then 


XOP) = X (bj -aj) = b-a = b-a = P(Io). E 
j 


I 


The verification of (2.3.3) for this J and P is a bit more involved: 


Exercise 2.4.3. (a) Prove that if J, I2,...,J, is a finite collection of 
intervals, and if Uj_, J; 2 I for some interval J, then i PU) > PZ). 
[Hint: Imitate the proof of Proposition 2.4.2.] 

(b) Prove that if I), J2,... is a countable collection of open intervals, and 
if Ui I; 2 I for some closed interval J, then Xai P(J;) > P(I). [Hint: 
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You may use the Heine-Borel Theorem, which says that if a collection of 
open intervals contain a closed interval, then some finite sub-collection of 
the open intervals also contains the closed interval.| 

(c) Verify (2.3.3), i.e. prove that if J), Jo,... is any countable collection of 
intervals, and if U2, I; 2 I for any interval I, then 05°, PU) 2 PU). 


[Hint: Extend the interval I; by €277 at each end, and decrease I by € at 
each end, while making J; open and J closed. Then use part (b).] 


In light of Proposition 2.4.2 and Exercise 2.4.3, we can apply Theo- 
rem 2.3.1 to conclude the following: 


Theorem 2.4.4. There exists a probability triple (Q, M, P*) such that 
Q = [0,1], M contains all intervals in [0,1], and for any interval I C (0, 1], 
P*(I) is the length of I. 


This probability triple is called either the uniform distribution on [0,1], or 
Lebesgue measure on [0,1]. Depending on the context, we sometimes write 
the probability measure P* as P or as À. 


Remark. Let B = o(7) be the o-algebra generated by J, i.e. the smallest 
a-algebra containing J. (The collection B is called the Borel a-algebra of 
subsets of [0,1], and the elements of B are called Borel sets.) Clearly, we 
must have M D B. In this case, it can be shown that M is in fact much 
bigger than B; it even has larger cardinality. Furthermore, it turns out that 
Lebesgue measure restricted to B is not complete, though on M it is (by 
Exercise 2.3.16). In addition to the Borel subsets of [0, 1], we shall also have 
occasion to refer to the Borel a-algebra of subsets of R, defined to be the 
smallest a-algebra of subsets of R which includes all intervals. 


Exercise 2.4.5. Let A= {(—00, zl]; re R}. Prove that a(A) = B, i.e. 
that the smallest o-algebra of subsets of R. which contains A is equal to the 
Borel a-algebra of subsets of R. [Hint: Does o(A) include all intervals?| 


Writing \ for Lebesgue measure on [0,1], we know that A{z} = 0 for 
any singleton set {x}. It follows by countable additivity that \(A) = 0 
for any set A which is countable. This includes (cf. Subsection A.2) the 
rational numbers, the integer roots of the rational numbers, the algebraic 
numbers, etc. That is, if X is uniformly distributed on [0,1], then P(X 
is rational) = 0, and P(X” is rational for some n € N) = 0, and P(X is 
algebraic) = 0, and so on. 

There also exist uncountable sets which have Lebesgue measure 0. The 
simplest example is the Cantor set K, defined as follows (see Figure 2.4.6). 
We begin with the interval [0,1]. We then remove the open interval con- 
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nome Jek J Je 


1 2 1 2 
0 9 9 3 3 


Figure 2.4.6. Constructing the Cantor set K. 


sisting of the middle third (1/3, 2/3). We then remove the open middle 
thirds of each of the two pieces, i.e. we remove (1/9, 2/9) and (7/9, 8/9). 
We then remove the four open middle thirds (1/27, 2/27), (7/27, 8/27), 
(19/27, 20/27), and (25/27, 26/27) of the remaining pieces. We continue 
inductively, at the nt” stage removing the 2”7} middle thirds of all remain- 
ing sub-intervals, each of length 1/3”. The Cantor set K is defined to be 
everything that is left over, after we have removed all these middle thirds. 

Now, the complement of the Cantor set has Lebesgue measure given 
by A(K°) = 1/3 + 2(1/9) + 4(1/27) +... = XZ; 2771/3" = 1. Hence, 
by (2.1.1), AK) =1-1=0. 

On the other hand, K is uncountable. Indeed, for each point x € K, let 
dp(z) = 0 or 1 depending on whether, at the nt? stage of the construction 
of K, x was to the left or the right of the nearest open interval removed. 
Then define the function f : K — [0,1] by f(x) = XZ; dn(£) 27”. It is 
easily checked that f(K) = [0, 1], i.e. that f maps K onto [0,1]. Since [0, 1] 
is uncountable, this means that K must also be uncountable. 


Remark. The Cantor set is also equal to the set of all numbers in (0, 1] 
which have a base-3 expansion that does not contain the digit 1. That is, 
Kat cab : each cn E 10,2} +. 


Exercise 2.4.7. (a) Prove that K,K° € B, where B are the Borel 
subsets of [0, 1]. 

(b) Prove that K,K° € M, where M is the o-algebra of Theorem 2.4.4. 
(c) Prove that K? € Bı, where B; is defined by (2.2.6). 

(d) Prove that K ¢ Bı. 

(e) Prove that Bı is not a o-algebra. 


On the other hand, from Proposition 1.2.6 we know that: 
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Proposition 2.4.8. For the probability triple (Q,M,P*) of Theo- 
rem 2.4.4 corresponding to Lebesgue measure on [0,1], there exists at least 
one subset H CQ with H ¢ M. 


2.5. Extensions of the Extension Theorem. 


The Extension Theorem (Theorem 2.3.1) will be our main tool for prov- 
ing the existence of complicated probability triples. While (2.3.2) is gener- 
ally easy to verify, (2.3.3) can be more challenging. Thus, we present some 
alternative formulations here. 


Corollary 2.5.1. Let J be a semialgebra of subsets of Q. Let P : J —> 
(0, 1] with P(@) = 0 and P(Q) = 1, satisfying (2.3.2), and the “monotonicity 
on J” property that 

P(A) < P(B) whenever A,BE J withACB, (2.5.2) 


and also the “countable subadditivity on J” property that 
P (U Ba) < S°P(Bn) for By, Bo,...€ J with |] Bn € J. (2.5.3) 
n n n 


Then there is a o-algebra M D J, and a countably additive probability 
measure P* on M, such that P*(A) = P(A) for all AE J. 


Proof. In light of Theorem 2.3.1, we need only verify (2.3.3). To that 
end, let A, A1, A2,...€ J with A C U, An. Set Bn = AN An. Then since 
ACU, An, we have A = U, (AN An) = U, Bn, whence (2.5.3) and (2.5.2) 


give that 
P(A) =P (Us) < X` P(Bn) < X P(A). E 
n n n 
Another version assumes countable additivity of P on J: 


Corollary 2.5.4. Let J be a semialgebra of subsets of Q. Let P : J > 
[0,1] with P(Q) = 1, satisfying the countable additivity property that 


P (U p.) = X P(Da) for Dı, D2,... € J disjoint with (J Dn € J. 


(2.5.5) 
Then there is a o-algebra M D J, and a countably additive probability 
measure P* on M, such that P*(A) = P(A) for all AE J. 
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Proof. Note that (2.5.5) immediately implies (2.3.2) (with equality), and 
that P(@) = 0. Hence, in light of Corollary 2.5.1, we need only verify (2.5.2) 
and (2.5.3). 

For (2.5.2), let A,B € J with A C B. Since J is a semialgebra, we 
can write AC = J Ô... Ù Jk, for some disjoint Jj,...,Jn E€ J. Then 
using (2.5.5), 


P(B) = P(A) +P(BN JA) +...+P(BN Jk) > P(A). 


For (2.5.3), let Bi, Bo,... € J with U, Bn € J. Set Dı = Bı, and 
Dy = B,NBEN...ABS_, forn > 2. Then {Dn} are disjoint, with U, Dn = 
U,, Bn. Furthermore, since J is a semialgebra, each D,, can be written as 


a finite disjoint union of elements of J, say Dn = Jnı has! Ink,» It then 
follows from (2.5.5) and (2.5.2) that 


(us) = (ya) -e(G04) 
= LDP) = DRON < DPB). 


Exercise 2.5.6. Suppose P satisfies (2.5.5) for finite collections {D,}. 
Suppose further that, whenever A), Ao,... € J such that Ans, C An 
and NZ; An = 0, we have limpooP(An) = 0. Prove that P also sat- 
isfies (2.5.5) for countable collections {Dp}. [Hint: Set An = (U2, D;) \ 


( U D;) | 
The extension of Theorem 2.3.1 also has a uniqueness property: 


Proposition 2.5.7. Let J, P, P*, and M be as in Theorem 2.3.1 (or as 
in Corollary 2.5.1 or 2.5.4). Let F be any o-algebra with J C F CM (e.g. 
F =M, or F =0(7)). Let Q be any probability measure on F, such that 
Q(A) = P(A) for all A € J. Then Q(A) = P*(A) for all A € F. 


Proof. For A € F, we compute 
P*(A) = inf 4,,49,...e7 ae P(A;) from (2.3.4) 


acl). 4: 
= inf 4,,4g,. er Do; Q(A;) since Q=P on J 
acl), A; 
> inf 4,,49,..e7 Q(U Aj) by countable subadditivity 
acl), A; 
> inf 4,,49,..e7 Q(A) by monotonicity 


acl, A 
i 
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: > Q(A). Similarly, P*(A°) > Q(A®), and then (2.1.1) implies 
1 — P*(A) > 1— Q(A), ie. P*(A) < Q(A). Hence, P*(A) = Q(A). E 


Proposition 2.5.7 immediately implies the following: 


Proposition 2.5.8. Let J be a semialgebra of subsets of Q, and let 
F = o(J) be the generated o-algebra. Let P and Q be two probability 
distributions defined on F. Suppose that P(A) = Q(A) for all A € J. 
Then P = Q, i.e. P(A) = Q(A) for all A € F. 


Proof. Since P and Q are probability measures, they both satisfy (2.3.2) 
and (2.3.3). Hence, by Proposition 2.5.7, each of P and Q is equal to the 
P* of Theorem 2.3.1. | 


One useful special case of Proposition 2.5.8 is: 


Corollary 2.5.9. Let P and Q be two probability distributions defined on 
the collection B of Borel subsets of R. Suppose P((—co, z]) = Q((—0o, z]) 
for all x € R. Then P(A) = Q(A) for all A € B. 


Proof. Since P((y, 00)) =1- P((—co, yJ), and P((x,y]) Sle 


P((—oo,2]) — P((y,00)), and similarly for Q, it follows that P and Q 
agree on 


J = {(ce, 1] we R}U{(y,00):ye R}U{(a,y]:ayeRbu{oR}. 
( 


2.5.10) 
But 7 is a semialgebra (Exercise 2.7.10), and it follows from Exercise 2.4.5 
that o(J) = B. Hence, the result follows from Proposition 2.5.8. | 


2.6. Coin tossing and other measures. 


Now that we have Theorem 2.3.1 to help us, we can easily construct 
other probability triples as well. 

For example, of frequent mention in probability theory is (independent, 
fair) coin tossing. To model the flipping of n coins, we can simply take 
Q = {(r1,72,...,1n); ri = 0 or 1} (where 0 stands for tails and 1 stands 
for heads), let F = 2° be the collection of all subsets of Q, and define P 
by P(A) = |A|/2” for A G F. This is another example of a discrete proba- 
bility space; and we know from Theorem 2.2.1 that these spaces present no 
difficulties. 
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But suppose now that we wish to model the flipping of a (countably) 
infinite number of coins. In this case we can let 


Q = {(r1,72,73,--.); ri = 0 or 1} 


be the collection of all binary sequences. But what about F and P? 
Well, for each n € N and each a1,a2,...,@n € {0,1}, let us define 
subsets Agjay...a, C Q by 


Agyay...an = {(r1, r2...) EQ; 7 =a; for 1 <i<n}. 


(Thus, Ao is the event that the first coin comes up tails; Aj, is the event 
that the first two coins both come up heads; and Ajo, is the event that 
the first and third coins are heads while the second coin is tails.) Then we 
clearly want P(Aq,a,...a,) = 1/2” for each set Ag,a,...a,- Hence, if we set 


J = {Aajag...an} TEN, G1, 02,...,4n E {0,1}} U {8,9}, 


then we already know how to define P(A) for each A € J. To apply the 
Extension Theorem (in this case, Corollary 2.5.4), we need to verify that 
certain conditions are satisfied. 


Exercise 2.6.1. (a) Verify that the above J is a semialgebra. 

(b) Verify that the above J and P satisfy (2.5.5) for finite collections {Dn}. 
(Hint: For a finite collection {Dn} C J, there is k € N such that the results 
of only coins 1 through k are specified by any Dn. Partition Q into the 
corresponding 2% subsets.] 


Verifying (2.5.5) for countable collections unfortunately requires a bit of 
topology; the proof of this next lemma may be skipped. 


Lemma 2.6.2. The above J and P (for infinite coin tossing) sat- 
isfy (2.5.5). 


Proof (optional). In light of Exercises 2.6.1 and 2.5.6, it suffices 
to show that for A1, Ao,... € J with Anyi C An and NZ; An = Í, 
limp—oo P(An) = 0. 

Give {0,1} the discrete topology, and give Q = {0,1} x {0,1} x... 
the corresponding product topology. Then Q is a product of compact sets 
{0,1}, and hence is itself compact by Tychonov’s Theorem. Furthermore 
each element of 7 is a closed subset of 2, since its complement is open in 
the product topology. Hence, each A, is a closed subset of a compact space, 
and is therefore compact. 
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The finite intersection property of compact sets then implies that there 
is N € N with A, = @ for all n > N. In particular, P(A,) — 0. a 


Now that these conditions have been verified, it then follows from Corol- 
lary 2.5.4 that the probabilities for the special sets Aa,ay...a, E J can au- 
tomatically be extended to a o-algebra M containing 7. (Once again, this 
o-algebra will be quite complicated, and we will never understand it com- 
pletely. But it is still essential mathematically that we know it exists.) This 
will be our probability triple for infinite fair coin tossing. 

As a sample calculation, let Hn = {(ri,r2,.-.) E€ Q; rn = 1} be the event 
that the nt? coin comes up heads. We certainly would hope that H, € M, 
with P(H,,) = 4. Happily, this is indeed the case. We note that 


. 
A, = U e S , 
T1:,T2,:--Tn—1€E{0,1} 


the union being disjoint. Hence, since M is closed under countable (includ- 
ing finite) unions, we have H,, € M. Then, by countable additivity, 


P(Hn) = 5 P (Areni) 


T1,T25.3Tn—1€E{0,1} 


= So 1/2" = 2-1/9" = 1/2. 
T1,T23-Tn—1€E{0,1} 
Remark. In fact, if we identify an element x € [0,1] by its binary 
expansion (r1,72,...), i.e. so that z= Pgo; r,,/2*, then we see that in fact 
infinite fair coin tossing may be viewed as being “essentially” the same thing 
as Lebesgue measure on (0, 1]. 


Next, given any two probability triples (21,71, P1) and (Q2, F2, P2), we 
can define their product measure P on the Cartesian product set Q) x Q2 = 
{(w1, w2) : wi € Q; (i = 1,2)}. We set 


J = {AxB; AERA, BEF}, (2.6.3) 


and define P(A x B) = Pı (A) Po(B) for Ax B € J. (The elements of J 
are called measurable rectangles.) 


Exercise 2.6.4. Verify that the above 7 is a semialgebra, and that 
0,Q¢€ J with P(O) =0 and P(@) =1. 


We will show later (Exercise 4.5.15) that these J and P satisfy (2.5.5). 
Hence, by Corollary 2.5.4, we can extend P to a o-algebra containing J. 
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The resulting probability triple is called the product measure of (Q1, Fa, Pi) 
and (Q2, Fo, Pp). 

An important special case of product measure is Lebesgue measure in 
higher dimensions. For example, in dimension two, we define 2-dimensional 
Lebesgue measure on [0,1] x [0,1] to be the product of Lebesgue measure 
on [0,1] with itself. This is a probability measure on Q = [0,1] x (0, 1] with 
the property that 


P((a, | x [e,d]) =(b=a\(¢d=0; O<a<b<1 


It is thus a measure of area in two dimensions. 

More generally, we can inductively define d-dimensional Lebesgue mea- 
sure on (0, 1]? for any d € N, by taking the product of Lebesgue measure on 
[0, 1] with (d — 1)-dimensional Lebesgue measure on [0, 1]¢-!. When d = 3, 
Lebesgue measure on [0,1] x [0,1] x [0,1] is a measure of volume. 


2.7. Exercises. 


Exercise 2.7.1. Let Q = {1,2,3,4}. Determine whether or not each of 
the following is a o-algebra. 


(a) Fi = {0 {1,2}, {3,4}, {1,2,3,4}}. 
(b) Fo = {0, {3}, {4}, {1,2}, {3,4}, {L23} {1,24}, {1,2,3,4}}, 
(e) Fa = {0, {1,2}, {1,3}, {1,4}, {2,3}, {2,4}, {3.4}, {1,2,3,4}}. 


Exercise 2.7.2. Let Q = {1,2,3,4}, and let J = {{1}, {2}}. Describe 
explicitly the g-algebra o( J) generated by J. 


Exercise 2.7.3. Suppose F is a collection of subsets of Q, such that 
Qe F. 

(a) Suppose that whenever A,B € F, then also A\ B = AN BS €F. 
Prove that F is an algebra. 

(b) Suppose F is a semialgebra. Prove that F is an algebra. 

(c) Suppose that F is closed under complement, and also closed under 
finite disjoint unions (i.e. whenever A, B € F are disjoint, then AUB E F). 
Give a counter-example to show that F might not be an algebra. 


Exercise 2.7.4. Let F,,Fo,... be a sequence of collections of subsets of 
Q, such that Fn C Fn+41 for each n. 

(a) Suppose that each F; is an algebra. Prove that U7, Fi is also an 
algebra. 
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(b) Suppose that each F; is a o-algebra. Show (by counter-example) that 
UZ Fi might not be a o-algebra. 


Exercise 2.7.5. Suppose that Q = N is the set of positive integers, and 
F is the set of all subsets A such that either A or AČ is finite, and P is 
defined by P(A) = 0 if A is finite, and P(A) = 1 if AŬ is finite. 

(a) Is F an algebra? 

(b) Is F a o-algebra? 

(c) Is P finitely additive? 

(d) Is P countably additive on F, meaning that if A,,Ao,... €E F i 
disjoint, and if it happens that U,, An € F, then P(U„ An) = 30, P(A 


Exercise 2.7.6. Suppose that Q = [0,1] is the unit interval, and F is the 
set of all subsets A such that either A or AC is finite, and P is defined by 
P(A) = 0 if A is finite, and P(A) = 1 if AÙ is finite. 

(a) Is F an algebra? 

(b) Is F a o-algebra? 

(c) Is P finitely additive? 

(d) Is P countably additive on F (as in the previous exercise)? 


Exercise 2.7.7. Suppose that Q = [0,1] is the unit interval, and F is 
the set of all subsets A such that either A or A? is countable (i.e., finite 
or countably infinite), and P is defined by P(A) = 0 if A is countable, and 
P(A) = 1 if A? is countable. 

(a) Is F an algebra? 

(b) Is F a o-algebra? 

(c) Is P finitely additive? 

(d) Is P countably additive on F? 


Exercise 2.7.8. For the example of Exercise 2.7.7, is P uncountably 
additive (cf. page 2)? 


Exercise 2.7.9. Let F be a o-algebra, and write |F] for the total number 
of subsets in F. Prove that if |F| < oo (i.e., if F consists of just a finite 
number of subsets), then |F| = 2™ for some m € N. [Hint: Consider those 
non-empty subsets in F which do not contain any other non-empty setset in 
F. How can all subsets in F be “built up” from these particular subsets?] 


Exercise 2.7.10. Prove that the collection 7 of (2.5.10) is a semialgebra. 
Exercise 2.7.11. Let Q = [0,1]. Let J’ be the set of all half-open 


intervals of the form (a, 5], for 0 < a < b < 1, together with the sets 9, Q, 
and {0}. 
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(a) Prove that 7’ is a semialgebra. 

(b) Prove that o(.7’) = B, i.e. that the o-algebra generated by this J’ is 
equal to the o-algebra generated by the J of (2.4.1). 

(c) Let BG be the collection of all finite disjoint unions of elements of J’. 
Prove that Bj is an algebra. Is By the same as the algebra Bo defined 
in (2.2.4)? 

[Remark: Some treatments of Lebesgue measure use J’ instead of 7.| 


Exercise 2.7.12. Let K be the Cantor set as defined in Subsection 2.4. 
Let Dn = K © + where K 6 + is defined as in (1.2.4). Let B= UZ; Dn. 
(a) Draw a rough sketch of D3. 

(b) What is A(D3)? 

(c) Draw a rough sketch of B. 

(d) What is \(B)? 


Exercise 2.7.13. | Give an example of a sample space Q, a semialgebra 
J, and a non-negative function P : J > R with P(@) = 0 and P(Q) = 1, 
such that (2.5.5) is not satisfied. 


Exercise 2.7.14. Let Q = {1,2,3,4}, with F the collection of all subsets 
of Q. Let P and Q be two probability measures on F, such that P{1} = 
P{2} = P{3} = P{4} = 1/4, and Q{2} = Q{4} = 1/2, extended to F by 
linearity. Finally, let 7 = {0,Q, {1,2}, {2, 3}, {3, 4}, {1,4}}. 

(a) Prove that P(A) = Q(A) for all A€ J. 

(b) Prove that there is A € o(J) with P(A) 4 Q(A). 

(c) Why does this not contradict Proposition 2.5.8? 


Exercise 2.7.15. Let (2,M, A) be Lebesgue measure on the interval 
(0, 1]. Let 
V = {aa eR 0<£<1,0<y<1}. 


Let F be the collection of all subsets of Q’ of the form 
{(x,y)€R?; cE A, 0<y<1} 
for some A € M. Finally, define a probability P on F by 
P ({(z,y) €R?; cE A, O<y<1}) = AA). 
(a) Prove that (V, F,P) is a probability triple. 
(b) Let P* be the outer measure corresponding to P and F. Define the 


subset S C 9! by 


S= {(2,y) € R; O<2<1, y=1/2h. 
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(Note that S ¢ F.) Prove that P*(S) = 1 and P*(S°) =1. 


Exercise 2.7.16. (a) Where in the proof of Theorem 2.3.1 was assump- 
tion (2.3.3) used? 

(b) How would the conclusion of Theorem 2.3.1 by modified if assump- 
tion (2.3.3) were dropped (but all other assumptions remained the same)? 


Exercise 2.7.17. Let Q = {1,2}, and let 7 be the collection of all subsets 
of Q, with P(0) = 0, P(Q) = 1, and P{1} = P{2} = 1/3. 

(a) Verify that all assumptions of Theorem 2.3.1 other than (2.3.3) are 
satisfied. 

(b) Verify that assumption (2.3.3) is not satisfied. 

(c) Describe precisely the M and P* that would result in this example 
from the modified version of Theorem 2.3.1 in Exercise 2.7.16(b). 


Exercise 2.7.18. Let 2 = {1,2}, J = {0,0, {1}}, P() =0, P(Q) =1, 
and P({1}) = 1/3. 

(a) Can Theorem 2.3.1, Corollary 2.5.1, or Corollary 2.5.4 be applied in 
this case? Why or why not? 

(b) Can this P be extended to a valid probability measure? Explain. 


Exercise 2.7.19. Let 2 be a finite non-empty set, and let 7 consist 
of all singletons in 2, together with Ø and Q. Let p : Q — [0,1] with 
ewen Pw) = 1, and define P(O) = 0, P(Q) = 1, and P{w} = p(w) for all 
wER. 

(a) Prove that J is a semialgebra. 

(b) Prove that (2.3.2) and (2.3.3) are satisfied. 

(c) Describe precisely the M and P* that result from applying Theo- 
rem 2.3.1. 

(d) Are these M and P* the same as those described in Theorem 2.2.1? 


Exercise 2.7.20. Let P and Q be two probability measures defined on 
the same sample space 2 and o-algebra F. 

(a) Suppose that P(A) = Q(A) for all A € F with P(A) < 5. Prove that 
P = Q, i.e. that P(A) = Q(A) for all A E€ F. 

(b) Give an example where P(A) = Q(A) for all A € F with P(A) < 4, 
but such that P Æ Q, i.e. that P(A) 4 Q(A) for some A € F. 


Exercise 2.7.21. Let A be Lebesgue measure in dimension two, i.e. 
Lebesgue measure on [0,1] x [0,1]. Let A be the triangle {(x, y) € [0,1] x 
[0,1]; y < x}. Prove that A is measurable with respect to À, and compute 
A(A). 
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Exercise 2.7.22. Let (Q1, F1, P1) be Lebesgue measure on [0,1]. Con- 
sider a second probability triple, (Q2, F2, P2), defined as follows: Q2 = 
{1,2}, Fe consists of all subsets of Q2, and P3 is defined by P2{1} = 4 
P2{2} = %, and additivity. Let (Q, F,P) be the product measure of 
(Qi, Fa, Pı) and (Q2, Fo, Po). 

(a) Express each of 2, F, and P as explicitly as possible. 


(b) Find a set A € F such that P(A) = 2. 


2.8. Section summary. 


The section gave a formal definition of a probability triple (Q, F, P), 
consisting of a sample space Q, a o-algebra F, and a probability measure 
P, and derived certain basic properties of them. It then considered the 
question of how to construct such probability triples. Discrete spaces (with 
countable Q) were straightforward, but other spaces were more challenging. 
The key tool was the Extension Theorem, which said that once a probability 
measure has been constructed on a semialgebra, it can then automatically 
be extended to a o-algebra. 

The Extension Theorem allowed us to construct Lebesgue measure on 
[0,1], and to consider some of its basic properties. It also allowed us to 
construct other probability triples such as infinite coin tossing, product 
measures, and multi-dimensional Lebesgue measure. 
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3. Further probabilistic foundations. 


Now that we understand probability triples well, we discuss some ad- 
ditional essential ingredients of probability theory. Throughout Section 3 
(and, indeed, throughout most of this text and most of probability theory 
in general), we shall assume that there is an underlying probability triple 
(Q,F,P) with respect to which all further probability objects are defined. 
This assumption shall be so universal that we will often not even mention 
it. 


3.1. Random variables. 


If we think of a sample space 2 as the set of all possible random outcomes 
of some experiment, then a random variable assigns a numerical value to 
each of these outcomes. More formally, we have 


Definition 3.1.1. Given a probability triple (0, F, P), a random variable 
is a function X from 2 to the real numbers R, such that 


{wEN; X(w) <2} EF, zTER. (3.1.2) 


Equation (3.1.2) is a technical requirement, and states that the func- 
tion X must be measurable. It can also be written as {X < x} € F, or 
X—!((-o0,a]) € F, for all x € R. Since complements and unions and 
intersections are preserved under inverse images (see Subsection A.1), it 
follows from Exercise 2.4.5 that equation (3.1.2) is equivalent to saying that 
X~'(B) € F for every Borel set B. That is, the set X~'(B), also written 
{X € B}, is indeed an event. So, for any Borel set B, it makes sense to 
talk about P(X € B), the probability that X lies in B. 


Example 3.1.3. Suppose that (Q, F, P) is Lebesgue measure on [0, 1], 
then we might define some random variables X, Y, and Z by X(w) = w, 
Y(w) = 2w, and Z(w) = 3w + 4. We then have, for example, that Y = 2X, 
and Z = 3X +4 = 3Y +4. Also, P(Y < 1/3) = P{w; Y(w) < 1/3} = 
P{w; 2w < 1/3} = P((0, 1/6]) = 1/6. 


Exercise 3.1.4. For Example 3.1.3, compute P(Z > a) and P(X < 
a and Y < b) as functions of a,b E€ R. 


Now, not all functions from 2 to R are random variables. For example, 
let (Q, F, P) be Lebesgue measure on [0,1], and let H C Q be the non- 
measurable set of Proposition 2.4.8. Define X : Q — R by X = 1pc, so 
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X(w) = 0 for w € H, and XWw) = 1 for w ¢ H. Then {w E€ Q : X(w) < 
1/2} = H ¢ F, so X is not a random variable. 

On the other hand, the following proposition shows that condition (3.1.2) 
is preserved under usual arithmetic and limits. In practice, this means that 
if functions from Q to R are constructed in “usual” ways, then (3.1.2) will 
be satisfied, so the functions will indeed be random variables. 


Proposition 3.1.5. (i) If X = 14 is the indicator of some event A € F, 
then X is a random variable. 

(ii) If X and Y are random variables and c € R, then X +c, cX, X?, X+Y, 
and XY are all random variables. 

(iii) If Zi, Z2,... are random variables such that limp—o Zn(w) exists for 
each w € Q, and Z(w) = limp Zn (w), then Z is also a random variable. 


Proof. (i) If X =1, for A € F, then X~1(B) must be one of A, AF, 0, 
or Q, so X—1(B) EF. 

(ii) The first two of these assertions are immediate. The third follows since 
for y > 0, {X* < y} = {X € [-vy, vyl} € F. For the fourth, note (by 
finding a rational number r € (X, x — Y)) that 


{X +Y <r}= U {X <r} n{Y <a2-r}) € F. 


r rational 


The fifth assertion then follows since XY = 4 [(X +Y} — X? - Y?]. 
(iii) For z € R, 


{z<3= NU Naser h. (3.1.6) 


But Z, is a random variable, so {Z, < £z + +} € F. Then, since F is a 
o-algebra, we must have {X < x} EF. | 


Exercise 3.1.7. Prove (3.1.6). [Hint: remember the definition of X (w) = 
limpoo Xn(w), cf. Subsection A.3.] 


Suppose now that X is a random variable, and f : R — R is a function 
from R to R which is Borel-measurable, meaning that f—'(A) € B for any 
A € B (where B is the collection of Borel sets of R). (Equivalently, f is a 
random variable corresponding to Q = R and F = B.) We can define a new 
random variable f(X), the composition of X with f, by f(X)(w) = f (X(w)) 
for each w € Q. Then (3.1.2) is satisfied since for B € B, {f(X) € B} = 
{X Ee JHB} €F. 
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Proposition 3.1.8. If f is a continuous function, or a piecewise- 
continuous function, then f is Borel-measurable. 


Proof. A basic result of point-set topology says that if f is continu- 
ous, then f~1(O) is an open subset of R. whenever O is. In particular, 
f~* ((a, 00)) is open, so f~*((z,00)) € B, so f~1((00, a]) € B. 

If f is piecewise-continuous, then we can write f = filz, + folla+...+ 
fnir, where the fj are continuous and the {J;} are disjoint intervals. It 
follows from the above and Proposition 3.1.5 that f is Borel-measurable. If 


For example, if f(z) = zë for k € N, then f is Borel-measurable. Hence, if 
X is a random variable, then so is X* for all k € N. 


Remark 3.1.9. In probability theory, the underlying probability triple 
(Q, F, P) is usually complete (cf. Exercise 2.3.16; for example this is always 
true for discrete probability spaces, or for those such as Lebesgue measure 
constructed using the Extension Theorem). In that case, if X is a random 
variable, and Y : Q — R such that P(X = Y) = 1, then Y must also be a 
random variable. 


Remark 3.1.10. In Definition 3.1.1, we assume that X is a real-valued 
random variable, i.e. that it maps 2 into the set of real numbers equipped 
with the Borel o-algebra. More generally, one could consider a random 
variable which mapped Q to an arbitrary second measurable space, i.e. to 
some second non-empty set Q’ with its own collection F’ of measurable 
subsets. We would then have X : Q — Q, with condition (3.1.2) replaced 
by the condition that X~!(A’) € F whenever A’ € F’. 


3.2. Independence. 


Informally, events or random variables are independent if they do not 
affect each other’s probabilities. Thus, two events A and B are independent 
if P(AN B) = P(A)P(B). Intuitively, the probabilistic proportion of the 
event B which also includes A (i.e., P(AN B)/P(B)) is equal to the overall 
probability of A (i.e., P(A)) — the definition uses products to avoid division 
by zero. 

Three events A, B, and C are said to be independent if all of the follow- 
ing equations are satisfied: P(AN B) = P(A)P(B); P(ANC) = P(A)P(C); 
P(BNC) = P(B)P(C); and P(AN BNC) = P(A)P(B)P(C). It is not 
sufficient (see Exercise 3.6.3) to check just the final — or just the first three 
— of these equations. More generally, a possibly-infinite collection {Ag }aer 
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of events is said to be independent if for each 7 € N and each distinct finite 
choice a1, Q2,...,a; € I, we have 


P(A, N Aa N.N Aa) = P(Aa)P(Aas)--»P(Aa;). (8-2.1) 


Exercise 3.2.2. Suppose (3.2.1) is satisfied. 

(a) Show that (3.2.1) is still satisfied if Aa, is replaced by AG . 

(b) Show that (3.2.1) is still satisfied if each Aa, is replaced by the corre- 
sponding AC. 

(c) Prove that if {Aa}acr is independent, then so is {AQ }aer. 


We shall on occasion also talk about independence of collections of 
events. Collections of events {Ag;a € I} are independent if for all j € N, 
for all distinct ay,...,aj; € I, and for all Ay € Ag,,..., Aj E Aaj, equation 
(3.2.1) holds. 

We shall also talk about independence of random variables. Random 
variables X and Y are independent if for all Borel sets Sı and S2, the 
events X~'(S,) and Y~!(S2) are independent, ie. P(X € S1, Y € S2) = 
P(X € S,) P(Y € S2). More generally, a collection {Xq;a € I} of random 
variables are independent if for all 7 € N, for all distinct a1,...,a; € J, 
and for all Borel sets 5},...,5;, we have 


P (Xo, € S1, Xas © S2, oy Kay E Si) 
= P(Xa, € S1) P(Xa, € Se)... P(Xa, € Sj). 
Independence is preserved under deterministic transformations: 


Proposition 3.2.3. Let X and Y be independent random variables. Let 
f,g: R — R be Borel-measurable functions. Then the random variables 
f(X) and g(Y) are independent. 


Proof. For Borel 51,52 C R, we compute that 
P (F(X) € S1, g(¥) € S2) = P(X € f° (S1), Y € g™"(S2)) 
= P(X e f™>(91)) P(Y €g7"(S2)) 
= P(f(X) € S1) P(g(Y) € S2). E 


We also have the following. 


Proposition 3.2.4. Let X andY be two random variables, defined jointly 
on some probability triple (Q,7,P). Then X and Y are independent if and 
only if P(X <a, Y <y)=P(X <2)P(Y < y) for all z,y € R. 
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Proof. The “only if” part is immediate from the definition. 

For the “if” part, fix z € R with P(X < x) > 0, and define the measure 
Q on the Borel subsets of R by Q(S) = P(X < 2, Y € S)/P(X < xz). 
Then by assumption, Q((—co,y]) = P(Y < y) for all y € R. It follows 
from Corollary 2.5.9 that Q(S) = P(Y € S) for all Borel S C R, i.e. that 
P(X < x, Y € S)=P(X <2)P(Y € S). 

Then, for fixed Borel S C R, let R(T) = P(X €T, Y € S)/P(X eT). 
By the above, it follows that R((—oo,z]) = P(X < 2) for each z € R. 
It then follows from Corollary 2.5.9 that R(T) = P(X € T) for all Borel 
T C R, ie. that P(X € T, Y € S) = P(X e T)P(Y € S) for all Borel 
S,T C R. Hence, X and Y are independent. E 


Independence will come up often in this text, and its significance will 
become more clear as we proceed. 


3.3. Continuity of probabilities. 


Given a probability triple (Q, F, P), and events A, Aj, A2,... € F, we 
write {An} 7 A to mean that A; C Ag C A3 C..., and U, An = A. 
In words, the events A, increase to A. Similarly, we write {An} N A to 
mean that {AC} / AČ, or equivalently that A; D Ag D A3 D..., and 
Mn An = A. In words, the events A, decrease to A. We then have 


Proposition 3.3.1. (Continuity of probabilities.) If {An} 7 A or 
{An} N A, then lino P(An) = P(A). 


Proof. Suppose {An} 7 A. Let Bn = An N AS. Then the {Bn} are 
disjoint, with UB, = U An = A. Hence, 


P(A) = p (Ùn) = S5 P(Bm) 


m m=1 
= Jim $. P(Bm) = sine (U Bn = lim P(An) 
m=1 m<n 


(where the last equality is the only time we use that the {Am} are a nested 


sequence). 
If instead {4n} N A, then {AG} 7 A®, so 


P(A) = 1- P(A?) = 1—limP(AR) 


= lim (1—P(Az)) = lim P(A). | 
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If the {An} are not nested, then we may not have lim, P(A,) = P(A). 
For example, suppose that A, = Q for n odd, but A, = Ý for n even. Then 
P(A,) alternates between 0 and 1, so that lim, P(A,) does not exist. 


3.4. Limit events. 


Given events Aj, A2,... E€ F, we define 


limsup Ay, = {An i.0.} = N U Ak 
ihe n=lk=n 
and is 
lim inf An = {An a.a.} = |] () Ar. 
n n=l kan 


The event limsup,, An is referred to as “A, infinitely often”; it stands for 
those w € Q which are in infinitely many of the An. Intuitively, it is the 
event that infinitely many of the events A, occur. Similarly, the event 
lim inf, An is referred to as “A, almost always”; intuitively, it is the event 
that all but a finite number of the events A, occur. 

Since F is a a-algebra, we see that lim UP An E€ F and liminf, A, E F. 
Also, by de Morgan’s laws, (lim sup„ An)? = lim inf (AG), so P(A, 4.0.) = 
1-P(AS a.a.). 

For example, suppose (Q, F, P) is infinite fair coin tossing, and H, is 
the event that the nt! coin is heads. Then limsup, Hn is the event that 
there are infinitely many heads. Also, liminf, H, is the event that all but 
a finite number of the coins were heads, i.e. that there were only finitely 
many tails. 


Proposition 3.4.1. We always have 


P (lim inf An) < liminf P(A,) < limsupP(A,) < P (iim sup An): 
n nr n n 


Proof. The middle inequality holds by definition, and the last inequality 
follows similarly to the first, so we e prove only the first inequality. We 


note that as n — œ, the events { A Ax} are increasing (cf. page 33) to 


lim | inf An. Hence, by continuity of probubrities, 


P (liminf An) = e (UÑ N a) = = lim P (a N |e) 


n k=n 
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o0 
or Sys 
lim inf P (A a) < liminf P(A), 
where the final equality follows by definition (if a limit exists, then it is equal 
to the liminf), and the final inequality follows from monotonicity (2.1.2). E 


For example, again considering infinite fair coin tossing, with H, the 
event that the nt? coin is heads. Proposition 3.4.1 says that P(H, i.0.) > , 
which is interesting but vague. To improve this result, we require a more 
powerful theorem. 


Theorem 3.4.2. (The Borel-Cantelli Lemma.) Let A,, A2,... E F. 
(i) If 30, P(An) < œ, then P(limsup,, An) = 0. 
(ii) If), P(An) = 00, and {An} are independent, then P(limsup,, An) = 1. 


Proof. For (i), we note that for any m € N, we have by countable 
subadditivity that 


P (tim sup 4n ) <r (Ü a] z 3 P(Ax), 


which goes to 0 as m — œ if the sum is convergent. 

For (ii), since (lim sup, An)? = US, NZ, AG, it suffices (by countable 
subadditivity) to show that P (NZ, Af) = 0 for each n € N. Well, for 
n,m E N, we have by independence and Exercise 3.2.2 (and since 1 — a < 
e—* for any real number x) that 


P(Man AE) < P( Mein AQ) 
= hey gle =a P(Ax)) 
Se. eo 
ao Pa anes P(Ax) 
which goes to 0 as m — œ if the sum is divergent. E 


This theorem is striking since it asserts that if {An} are independent, 
then P(lim sup, An) is always either 0 or 1 — it is never 4 or ł or any other 
value. In the next section we shall see that this statement is true even more 
generally. 

We note that the independence assumption for part (ii) of Theorem 3.4.2 
cannot simply be omitted. For example, consider infinite fair coin tossing, 
and let A, Ay = A3 tt {ri 1}, ie. let all the events be the 
event that the first coin comes up heads. Then the {An} are clearly not 
independent. And, we clearly have P(limsup,, An) = P(r) = 1) = 4. 
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Theorem 3.4.2 provides very precise information about P(lim sup, An) 
in many cases. Consider again infinite fair coin tossing, with Hn the event 
that the n* coin is heads. This theorem shows that P(H,, i.o.) = 1, i.e. 
there is probability 1 that an infinite sequence of coins will contain infinitely 
many heads. Furthermore, P(H, a.a.) = 1 — P(HE i.o.) = 1 — 1 = 0, so 
the infinite sequence will never contain all but finitely many heads. 

Similarly, we have that 


P {Hony N Hono Meal Aon +hHog, n] i.o. } =t i 


since the events {Hon +1NH2r+2N. . .NHoən +llog, n] are seen to be independent 
for different values of n, and since their probabilities are approximately 1/n 
which sums to infinity. On the other hand, 


P { Hon4i N Hono Aarh Honjo logs n] i.o. } = 0, 


since in this case the probabilities are approximately 1/n? which have finite 
sum. 

An event like P(B, i.0.), where By, = {Hn N Hn+1}, is more difficult. 
In this case }* P(Bn) = >°,,(1/4) = œo. However, the {Bn} are not inde- 
pendent, since Ba and Bn+ı both involve the same event Hy, (i.e., the 
(n + 1)* coin). Hence, Theorem 3.4.2 does not immediately apply. On 
the other hand, by considering the subsequence n = 2k of indices, we see 
that {Bo,}, are independent, and YZ, P( Bax) = Dg; P( Bor) = 00. 
Hence, P( Bo, i.0.) = 1, so that P(B, 1.0.) = 1 also. 

For a similar but more complicated example, let Bn = {Hn419 AnieN 
-N Ant flogs logy nj}: Again, }> P(B,,) = œ, but the {Bn} are not indepen- 
dent. But by considering the subsequence n = 2* of indices, we compute 
that {Bx} are independent, and 5°, P( Box) = œ. Hence, P( Bye i.0.) = 1, 
so that P(Bp t.o.) = 1 also. 


3.5. Tail fields. 


Given a sequence of events A4, A2,..., we define their tail field by 
0 
T = (] ol(An, Anti, Antz,- +). 
n=l 


In words, an event A € 7 must have the property that for any n, it depends 
only on the events Án, An4i,-..; in particular, it does not care about any 
finite number of the events An. 

One might think that very few events could possibly be in the tail field, 
but in fact it sometimes contains many events. For example, if we are 
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considering infinite fair coin tossing (Subsection 2.6), and H, is the event 
that the n*! coin comes up heads, then 7 includes the event lim sup,, Hn 
that we obtain infinitely many heads; the event lim inf, H, that we obtain 
only finitely many tails; the event limsup, Ho. that we obtain infinitely 
many heads on tosses 2,4,8,...; the event {limn—oo + Fans 1} that 
the limiting fraction of heads is < L the event {fn = Tn+41 = Tn42 1.0.} 
that we infinitely often obtain the same result on three consecutive coin 
flips; etc. So we see that 7 contains many interesting events. 
A surprising theorem is 


Theorem 3.5.1. (Kolmogorov Zero-One Law.) If events A1, A9,... are 
independent, with tail-field 7, and if A € 7, then P(A) =0 or 1. 


To prove this theorem, we need a technical result about independence. 


Lemma 3.5.2. Let B, Bı, Bo,... be independent. Then {B} and 
o(B,, B2,...) are independent classes, i.e. if S € o (Bı, Bo,...), then P(S A 
B) = P(S)P(B). 


Proof. Assume that P(B) > 0, otherwise the statement is trivial. 

Let J be the collection of all sets of the form Dj, 1.Di,N...9D;,,, where 
n € N and where Dj, is either B;, or BẸ, together with @ and Q. Then for 
A € J, we have by independence that P(A) = P(B N A)/P(B). 

Now define a new probability measure Q on o(B1, B2,...) by Q(S) = 
P(B N S)/P(B), for S € o(Bı, B2,...). Then Q(%) = 0, Q(Q) = 1, and 
Q is countably additive since P is, so Q is indeed a probability measure. 
Furthermore, Q and P agree on J. Hence, by Proposition 2.5.8, Q and P 
agree on o(J) = o(B,, B2,...). That is, P(S) = Q(S) = P(B N S)/P(B) 
for all S € ø (Bı, Bo,...), as required. | 


Applying this lemma twice, we obtain: 


Corollary 3.5.3. Let Aj, Ao,..., B1, B2,... be independent. Then if 
Sı E€ o(A1, Ao,...), then S1, Bi, B2,... are independent. Furthermore, the 
o-algebras o(Aj, Ag,...) and o(B,, Bo,...) are independent classes, i.e. if 
Si € o(Ai, Ag, its ); and So € o(B,, Bo,.. J then P(S1NS2) = P(S) P(S2). 


Proof. For any distinct 71,%2,...,7n, let A= Ba N... N Bip. Then 


it follows immediately that A, A1, Á2,... are independent. Hence, from 
Lemma 3.5.2, if S$ € o( A, Ao, ive J then P(A N Sı) = P(A) P(S). Since 
this is true for all distinct 71,...,2,, it follows that S1, B1, B2,... are inde- 


pendent. Lemma 3.5.2 then implies that if So € o(B1ı, B2,...), then S; and 
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Sə are independent. E 


Proof of Theorem 3.5.1. We can now easily prove the Kolmogorov 
Zero-One Law. The proof is rather remarkable! 

Indeed, A € o(An, An4i,.--), so by Corollary 3.5.3, A, Ai, A2,...,An-1 
are independent. Since this is true for all n € N, and since independence is 
defined in terms of finite subcollections only, it is also true that A, A1, A2,... 
are independent. Hence, from Lemma 3.5.2, A and S are independent for 
all S € o(Aj, Ae,...). 

On the other hand, A € r C o(Aj, Ag,..-). It follows that A is inde- 
pendent of itself (!). This implies that P(A MA) = P(A) P(A). That is, 
P(A) = P(A}, so P(A) =0or 1. | 


3.6. Exercises. 


Exercise 3.6.1. Let X be a real-valued random variable defined on a 
probability triple (0, F, P). Fill in the following blanks: 

(a) F is a collection of subsets of : 

(b) P(A) is a well-defined element of provided that A is an ele- 
ment a 

(c) {X < 5} is shorthand notation for the particular subset of 

which is defined by: 


(d) If S is a subset of , then {X € S} is a subset of 
(e) If Sisa subset of , then {X € S} must be. an 
element of : 


Exercise 3.6.2. Let (Q, F,P) be Lebesgue measure on [0,1]. Let A = 
(1/2,3/4) and B = (0,2/3). Are A and B independent events? 


Exercise 3.6.3. Give an example of events A, B, and C, each of proba- 
bility strictly between 0 and 1, such that 

(a) P(AN B) = P(A)P(B), P(ANC) = P(A)P(C), and P(B N C) = 
P(B)P(C); but it is not the case that P(A N BNC) = P(A)P(B)P(C). 
(Hint: You can let Q be a set of four equally likely points.] 

(b) P(AN B) = P(A)P(B), P(ANC) = P(A)P(C), and P(AN BNC) = 
P(A)P(B)P(C); but it is not the case that P(B NC) = P(B)P(C). (Hint: 
You can let Q be a set of eight equally likely points.] 


Exercise 3.6.4. Suppose {An} Z A. Let f : Q — R be any function. 
Prove that limp. infwea, f(w) = infues flw). 
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Exercise 3.6.5. Let (0,4, P) be a probability triple such that 2 is count- 
able. Prove that it is impossible for there to exist a sequence Aj, Ao,...€ F 
which is independent, such that P(A;) = 4 for each i. [Hint: First prove 
that for each w € Q, and each n € N, we have P ({w}) < 1/2”. Then derive 
a contradiction. ] 


Exercise 3.6.6. Let X,Y, and Z be three independent random variables, 
and set W = X +Y. Let Ben = {(n—1)2-* < X < n2-*} and let 
Cem = {(m—1)2-* < Y < m2-*}. Let 


Ak = J (Bkn N Ckm) ‘ 


n,meZ 
(n+m)2-Kex 


Fix ,z € R, and let A = {X +Y < x} = {W <a} and D ={Z < z}. 
(a) Prove that {Ak} Z A. 
(b) Prove that Ax and D are independent. 


(c) By continuity of probabilities, prove that A and D are independent. 
(d) Use this to prove that W and Z are independent. 


Exercise 3.6.7. Let (Q, F,P) be the uniform distribution on Q = 
{1,2,3}, as in Example 2.2.2. Give an example of a sequence Aj, Ag,... € F 
such that 


P(lim inf An) < liminf P (An) < limsup P (An) < P(lim sup An) 
i.e. such that all three inequalities are strict. 


Exercise 3.6.8. Let À be Lebesgue measure on [0,1], and let 0 < a < b < 
c <d< 1 be arbitrary real numbers with d < b+ c— a. Give an example 
of a sequence Aj, Ag,... of intervals in [0,1], such that (lim inf, An) = a, 
liminf, A(An) = b, lim sup, A(An) = c, and A(lim sup, An) = d. For bonus 
points, solve the question when d > b+ c — a, with each A, a finite union 
of intervals. 


Exercise 3.6.9. Let Aj, Ao,..., B1, Bo,... be events. 
(a) Prove that 


(im sup An) N (lim sup Bn) > limsup(A,NB,,) . 


(b) Give an example where the above inclusion is strict, and another ex- 
ample where it holds with equality. 
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Exercise 3.6.10. Let A1, Á2,... be a sequence of events, and let N € N. 
Suppose there are events B and C such that B C An C C forall n > N, and 
such that P(B} = P(C). Prove that P(liminf, An) = P(limsup, An) = 
P(B) = P(C). 


Exercise 3.6.11. Let {Xn}%; be independent random variables, with 
Xn ~ Uniform({1,2,...,n}) (cf. Example 2.2.2). Compute P(X, = 
5 i.o.), the probability that an infinite number of the Xn are equal to 5. 


Exercise 3.6.12. Let X be a random variable with P(X > 0) > 0. Prove 
that there is 6 > 0 such that P(X > 6) > 0. [Hint: Don’t forget continuity 
of probabilities.] 


Exercise 3.6.13. Let X1, X2,... be defined jointly on some probability 
space (Q, F, P), with E[X;] = 0 and E[(X;)?] = 1 for all i. Prove that 
P|Xn > n i.o] =0. 


Exercise 3.6.14. Let ô, <€ > 0, and let X1, X2,... be a sequence of non- 
negative random variables such that P(X; > 6) > e for all i. Prove that 
with probability one, 77°, X; = œ. 


Exercise 3.6.15. Let A1, Ao,... be a sequence of events, such that (i) 
Aii, iz +-+, Aip are independent whenever 7341 > ij +2 forl <j <k-—1, 
and (ii) $>,, P(An) = œ. Then the Borel-Cantelli Lemma does not directly 
apply. Still, prove that P(lim sup, An) = 1. 


Exercise 3.6.16. Consider infinite, independent, fair coin tossing as 
in Subsection 2.6, and let H, be the event that the nt? coin is heads. 
Determine the following probabilities. 

(a) P(An41 N An+2 Mee Ayn+9 i.0.). 

(b) P(An4i N An+2 MN... Aon i.0.). 

(c) P(An41 ial An+2 eae A An +12 logs n] i.0.). 

(d) Prove that P(Hn+1 N Hn+2N... N An+fiog, n] 7-0.) must equal either 0 
or 1. 

(e) Determine P(An41 N Any29-..9 Ansfiog, nj 7-0-). [Hint: Find the 
right subsequence of indices.] 


Exercise 3.6.17. Show that Lemma 3.5.2 is false if we require only that 
P(BNB,) = P(B) P(B,) for each n € N, but do not require that the {Bn} 
be independent of each other. [Hint: Don’t forget Exercise 3.6.3(a).] 


Exercise 3.6.18. Let A,, A2,... be any independent sequence of events, 
and let Ss = {limpoo ł X la; < £}. Prove that for each z € R we 
have P(S;) =0 or 1. 
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Exercise 3.6.19. Let Aj, Ag,... be independent events. Let Y be a 
random variable which is measurable with respect to g(An, An41,.-.) for 
each n € N. Prove that there is a real number a such that P(Y = a) = 1. 
[Hint: Consider P(Y < x) for x € R; what values can it take?] 


3.7. Section summary. 


In this section, we defined random variables, which are functions on 
the state space. We also defined independence of events and of random 
variables. We derived the continuity property of probability measures. We 
defined limit events and proved the important Borel-Cantelli Lemma. We 
defined tail fields and proved the remarkable Kolmogorov Zero-One Law. 
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4. Expected values. 


There is one more notion that is fundamental to all of probability theory, 
that of expected values. The general definition of expected value will be 
developed in this section. 


4.1. Simple random variables. 


Let (Q,F,P) be a probability triple, and let X be a random variable 
defined on this triple. We begin with a definition. 


Definition 4.1.1. A random variable X is simple if range(X) is finite, 
where range(X) = {X(w);w E Q}. 


That is, a random variable is simple if it takes on only a finite number of 
different values. If X is a simple random variable, then listing the distinct 
elements of its range as £1, £2,..., Zn, we can then write X = iai zila, 
where A; = {w € Q; X(w) = aj} = X~1({z;}), and where the 14, are 
indicator functions. We note that the sets A; form a finite partition of Q. 

For such a simple random variable X = >; i14,, we define its ez- 
pected value or expectation or mean by E(X) = 37; iP(Ai). That is, 


E > vata) = XO 2iP(Ai), {A;} a finite partition of Q. (4.1.2) 
i=1 


i=1 


We sometimes write ux for E(X). 


Exercise 4.1.3. Prove that (4.1.2) is well-defined, in the sense that if { A;} 
and {B;} are two different finite partitions of 2, such that >, ala, = 
Dj- Yile;, then X; ZP (Ai) = L; Ys P(B;). (Hint: collect together 
those A; and B; corresponding to the same values of z; and y;.| 


For a quick example, let (Q, F, P) be Lebesgue measure on [0,1], and 
define simple random variables X and Y by 


2, w rational 

_f 5, w>Il/3 J 4, w=1/¥2 
oon { 3 w <1/3, ee) 6, other w < 1/4 

8 otherwise. 


Then it is easily seen that E(X) = 13/3, and E(Y) = 15/2. 
From equation (4.1.2), we see immediately that E(14) = P(A), and 
that E(c) = c. We now claim that E(-) is linear. Indeed, if X = };; x:14, 
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and Y = }7, yjlp,, where {A;} and {B;} are finite partitions of Q, and if 
a,b, € R, then {A; N B;} is again a finite partition of Q, and we have 


E(aX + bY) = E (Z; lari + by) tans, 
= Yj (ari + by; )P(A; N B;) 
= a), 2iP(Ai) +60, yy P(B;) 
= ak(X)+bE(Y), 


as claimed. It follows that E (X; zila,) = Of, 2iP(A;) for any finite 
collection of subsets A; C Q, even if they do not form a partition. 

It also follows that E(-) is order-preserving, i.e. if X < Y (meaning that 
X(w) < Y(w) for all w € Q), then E(X) < E(Y). Indeed, in that case 
Y — X > 0, so from (4.1.2) we have E(Y — X) > 0; by linearity this implies 
that E(Y) — E(X) > 0. 

In particular, since —|X| < X < |X|, we have [E(X)| < E(|X|), which 
is sometimes referred to as the (generalised) triangle inequality. If X takes 
on the two values a and 6, each with probability 3, then this inequality 
reduces to the usual ja + b| < |a| + |b]. 

Finally, if X and Y are independent simple random variables, then 
E(XY) = E(X)E(Y). Indeed, again writing X = S0,a;14, and Y = 
2; vile, where {A;} and {B;} are finite partitions of Q and where {a;} 
are distinct and {y;} are distinct, we see that X and Y are independent 
if and only if P(A; N B;) = P(A;)P(B;) for all i and j. In that case, 
E(XY) = Dij ziy;P (A; N B;) = Dij xiyjP(A;) P(B;) = E(X) E(Y), as 
claimed. Note that this may be false if X and Y are not independent; 
for example, if X takes on the values +1, each with probability Ł, and if 
Y = X, then E(X) = E(Y) = 0 but E(XY) = 1. Also, we may have 
E(XY) = E(X)E(Y) even if X and Y are not independent; for example, 
this occurs if X takes on the three values 0, 1, and 2 each with probability 
3, and if Y is defined by Y (w) = 1 whenever X (w) = 0 or 2, and Y (w) = 5 
whenever X (w) = 1. 

If X =}; zil4,, with {A;} a finite partition of Q, and if f : R > R is 
any function, then f(X) = $}; f(z:)L4, is also a simple random variable, 
with E(f(X)) = 3, f(<:)P(4;). 

In particular, if f(z) = (x — ux)?, we get the variance of X, defined 
by Var(X) = E ((X — ux})?°). Clearly Var(X) > 0. Expanding the square 
and using linearity, we see that Var(X) = E(X?) — u% = E(X?) - E(X}. 
In particular, we always have 


Var(X) < E(X’). (4.1.4) 
It also follows immediately that 


Var(aX +8) = a? Var(X). (4.1.5) 


4.2. GENERAL NON-NEGATIVE RANDOM VARIABLES. 45 


We also see that Var(X +Y) = Var(X) + Var(Y) + 2 Cov(X,Y), where 
Cov(X,Y) = E((X — wx)(Y — ny )) = E(XY) — E(X)E(Y) is the covari- 
ance; in particular, if X and Y are independent then Cov(X,Y) = 0 so 
that Var(X +Y) = Var(X) + Var(Y). More generally, Var(}>;X;) = 
>; Var(X;) +2 $;<; Cov(X;, X;), so we see that 


Var(Xı +... + Xn) = Var(X1) +... + Var(X,,), {Xn} independent. 
(4.1.6) 
Finally, if Var(X) > 0 and Var(Y) > 0, then the correlation between 
X and Y is defined by Corr(X,Y) = Cov(X,Y)/,/Var(X) Var(Y); see 
Exercises 4.5.11 and 5.5.6. 

This concludes our discussion of the basic properties of expectation for 
simple random variables. (Indeed, it is possible to read Section 5 immedi- 
ately at this point, provided that one restricts attention to simple random 
variables only.) We now note a fact that will help us to define E(X) for 
random variables X which are not simple. It follows immediately from the 
order-preserving property of E(-). 


Proposition 4.1.7. If X is a simple random variable, then 


E(X) = sup{E(Y); Y simple, Y < X}. 


4.2. General non-negative random variables. 


If X is not simple, then it is not clear how to define its expected value 
E(X). However, Proposition 4.1.7 provides a suggestion of how to pro- 
ceed. Indeed, for a general non-negative random variable X, we define the 
expected value E(X) by 


E(X) = sup{E(Y); Y simple, Y < X}. 


By Proposition 4.1.7, if X happens to be a simple random variable then 
this definition agrees with the previous one, so there is no confusion in re- 
using the same symbol E(-). Indeed, this one single definition (with a minor 
modification for negative values in Subsection 4.3 below) will apply to all 
random variables, be they discrete, absolutely continuous, or neither (cf. 
Section 6). 

We note that it is indeed possible that E(X) will be infinite. For ex- 
ample, suppose (Q, F,P) is Lebesgue measure on [0,1], and define X by 
X(w) = 2” for 27” < w < 270-1), (See Figure 4.2.1.) Then E(X) > 
Sop 2427 = N for any N €N. Hence, E(X) = co. 

Recall that for k € N, the k*® moment of a non-negative random vari- 
able X is defined to be E(X*), finite if E(|X|*) < oo. Since |z|*-! < 
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Figure 4.2.1. A random variable X having E(X) = œ. 


max(|z|*,1) < |z|? +1 for any x € R, we see that if E(|X|*) is finite, then 
so is E(|X|*~?). 

It is immediately apparent that our general definition of E(-) is still 
order-preserving. However, proving linearity is less clear. To assist, we 
have the following result. We say that {Xn} 7 X if Xı < Xo <..., and 
also limp—oo Xn(w) = X(w) for each w € Q. (That is, the sequence {Xn} 
converges monotonically to X.) 


Theorem 4.2.2. (The monotone convergence theorem.) Suppose 
Xı, X2,... are random variables with E(X1) > —oo, and {Xn} Z X. Then 
X is a random variable, and limp—oo E(Xn) = E(X). 


Proof. We know from (3.1.6) that X is a random variable (alternatively, 
simply note that in this case, {X < z} = p {Xn < x} € F for all z € R). 
Furthermore, by monotonicity we have E(X1) < E(X2) < ... < E(X), so 
that lim, E(Xn) exists (though it may be infinite if E(X) = +00), and is 
< E(X). 

To finish, it suffices to show that lim, E(X,) > E(X). If E(X1) = +00 
this is trivial, so assume E(X1) is finite. Then, by replacing Xn by Xn- Xı 
and X by X — Xj, it suffices to assume the X, and X are non-negative. 
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By the definition of E(X) for non-negative X, it suffices to show that 
lim, E(X,,) > E(Y) for any simple random variable Y < X. Writing Y = 
>; Vila;, we see that it suffices to prove that lim, E(X,) > >, viP(Ai), 
where {A;} is any finite partition of Q with v; < X(w) for all w € Aj. 

To that end, choose € > 0, and set Ain = {w € Aj; Xn(w) > v; — e}. 
Then {Ain} Z A; as n — oo. Furthermore, E(X,,) > °;(vi~e)P(Ain). As 
n — oo, by continuity of probabilities this converges to $7, (v; — e)P(Ai) = 
>; viP(Ai) — e. Hence, lim E(X,) > 0; v;P (Ai) — €. Since this is true for 
any € > 0, we must (cf. Proposition A.3.1) have lim E(X,) > >>, uiP(Ai), 
as required. | 


Remark 4.2.3. Since expected values are unchanged if we modify the ran- 
dom variable values on sets of probability 0, we still have lim, _... E(X,) = 
E(X) provided {Xn} 7 X almost surely (a.s.), i.e. on a subset of Q having 
probability 1. (Compare Remark 3.1.9.) 


We note that the monotonicity assumption of Theorem 4.2.2 is indeed 
necessary. For example, if (Q, F, P) is Lebesgue measure on (0, 1], and if 
Xn =n1o,1), then Xn — 0 (since for each w € [0,1] we have X,(w) = 0 
for all n > 1/w), but E(X,,) = 1 for all n. 

To make use of Theorem 4.2.2, set Y,a (£) = min(n, 2~"|2"z]) for z > 
0, where |r| is the floor of r, or greatest integer not exceeding r. (See 
Figure 4.2.4.) Then Y, (x) is a slightly rounded-down version of x, truncated 
at n. Indeed, for fixed x > 0 we have that U(x) > 0, and {W,,(x)} 7 z as 
n — oo. Furthermore, the range of Y, is finite (of size n2” + 1). Hence, 
this shows: 


Proposition 4.2.5. Let X be a general non-negative random variable. 
Set Xn = V,(X) with VU, as above. Then Xn > 0 and {Xn} 7 X, and 
each Xn is a simple random variable. In particular, there exists a sequence 
of simple random variables increasing to X. 


Using Theorem 4.2.2 and Proposition 4.2.5, it is straightforward to prove 
the linearity of expected values for general non-negative random variables. 
Indeed, with X, = W,(X) and Y, = W,(Y), we have (using linearity of 
expectation for simple random variables) that for X,Y > 0 and a,b > 0, 


E(aX + bY) = lim E(aX,, + bY,) = lim (aE(X,) + bE(Y;)) 


= aE(X)+bE(Y). (4.2.6) 
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Figure 4.2.4. A graph of y = W2(z). 


Similarly, if X and Y are independent, then by Proposition 3.2.3, Xn 
and Y, are also independent. Hence, if E(X) and E(Y) are finite, 


E(XY) = lim E(Xn Yn) = limE(X,) E(¥n) = E(X)E(Y), (4.2.7) 


as was the case for simple random variables. It then follows that Var(X + 
Y) = Var(X) + Var(Y) exactly as in the previous case. 

We also have countable linearity. Indeed, if X1, X2,... > 0, then by the 
monotone convergence theorem, 


E(X1+Xo+...)=E (lim x1 HRS ee Xn) = lim E(X1+X2+...+-Xn) 


= lim [E(X1) + E(X2) +... + E(Xn)] = E(X1) + E(X2) +... (4.2.8) 


If the X; are non non-negative, then (4.2.8) may fail, though it still holds 
under certain conditions; see Exercise 4.5.14 (and Corollary 9.4.4). 
We also have the following. 


Proposition 4.2.9. If X is a non-negative random variable, then 
Dia P(X > k) = E|X], where |X| is the greatest integer not exceeding 
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X. (In particular, if X is non-negative-integer valued, then Xz, P(X > 
k) = E(X).) 


Proof. We compute that 


S P(X >k) = So IPUk SX Ckh+1) + PREIS X <ht+2+..] 


= epes xX <e41) = YPX] =4) = ELX]. i 
é=1 


4.3. Arbitrary random variables. 


Finally, we consider random variables which may be neither simple nor 
non-negative. For such a random variable X, we may write X = X+—X~, 
where X*(w) = max(X(w),0) and X~(w) = max(—X(w),0). Both Xt 
and X~ are non-negative, so the theory of the previous subsection applies 
to them. We may then set 


E(X) = E(X*)-—E(X7). (4.3.1) 


We note that E(X) is undefined if both E(Xt) and E(X— ) are infinite. 
However, if E(X*+) = œ and E(X~) < œ, then we take E(X) = oo. 
Similarly, if E(X+) < œ and E(X~) = oo, then we take E(X) = —oo. 
Obviously, if E(X*) < œ and E(X7~) < oo, then E(X) will be a finite 
number. 

We next check that, with this modification, expected value retains the 
basic properties of order-preserving, linear, etc.: 


Exercise 4.3.2. Let X and Y be two general random variables (not 
necessarily non-negative) with well-defined means, such that X < Y. 

(a) Prove that Xt < Y+ and X` > Y^. 

(b) Prove that expectation is still order-preserving, i.e. that E(X) < E(Y) 
under these assumptions. 


Exercise 4.3.3. Let X and Y be two general random variables with 
finite means, and let Z = X +Y. 

(a) Express Z* and Z~ in terms of X+, X7, Y+, and Y~. 

(b) Prove that E(Z) = E(X)+E(Y), ie. that E(Z+)—E(Z7) = E(Xt)— 
E(X~)+E(Y*)—E(Y—). [Hint: Re-arrange the relations of part (a) so 
that you can make use of (4.2.6).] 
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(c) Prove that expectation is still (finitely) linear, for general random vari- 
ables with finite means. 


Exercise 4.3.4. Let X and Y be two independent general random 
variables with finite means, and let Z = XY. 

(a) Prove that X* and Y* are independent, and similarly for each of X* 
and Y~, and X~ and Yt, and X- and Y-. 

(b) Express Zt and Z~ in terms of Xt, X`, Yt, and Y~. 

(c) Prove that E(XY) = E(X) E(Y). 


4.4. The integration connection. 


Given a probability triple (Q, F, P), we sometimes write E(X) as fẹ XdP 
or fa X(w)P(dw) (the “Q” is sometimes omitted). We call this the integral 
of the (measurable) function X with respect to the (probability) measure 
P. 

Why do we make this identification? Well, certainly expected value 
satisfies some similar properties to that of the integral: it is linear, order- 
preserving, etc. But a more convincing reason is given by 


Theorem 4.4.1. Let (Q,F,P) be Lebesgue measure on [0,1]. Let 
X : [0,1] — R be a bounded function which is Riemann integrable (i.e. 
integrable in the usual calculus sense). Then X is a random variable with 
respect to (Q,F,P), and E(X) = h X(t)dt. In words, the expected value 
of the random variable X is equal to the calculus-style integral of the func- 
tion X. 


Proof. Recall the definitions of lower and upper integrals, viz. 


l n 
Lf Xese, it, 0: 0=% <t) <...<ty =1} ; 


1 n 
uf xemt {Sta sup X(t); 0=to< tı <n ctyaah. 
0 i t 


i—ı[t<t; 


Recall that we always have L h X <U fa X, and that X is Riemann 
integrable if L fo X=U f X, in which case h X(t) dt is defined to be this 


common value. 

But X`; (ti— ti—1) infe, <e<t, X(t) = E(Y), where the simple random 
variable Y is defined by Y (w) = inft,_ı<t<t; X(t) whenever ti_1 < w < ti. 
Hence, since X is Riemann integrable, for each n € N we can find a simple 
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random variable Y„ < X with E(Y,) > L a X(t) dt — +. Similarly we can 
find Zn > X with E(Zn) < U fy X(t)dt +4. Let An = max(Yi,...,Yn), 
and Bn = min(Z,,...,Zn). 

We claim that {An} Z X a.s. Indeed, {An} is increasing by construc- 
tion. Furthermore, A, < X < Bn, and limp... E(An) = limp... E(Bn) = 
Ji X(t) dt. Hence, if Sp = {w € Q : limn soo (Bn(w) — An(w)) > 1/k}, then 
0 = limps E(Bn — An) > P(Sk)/k, so P(S) = 0 for all k € N. Then 
P[ limno Án < X] < P[limp>o An < limn Bn] < P[U; Sk] = 0 by 
countable subadditivity, proving the claim. 

Hence, by Theorem 4.2.2 and Remarks 4.2.3 and 3.1.9, X must be a 
random variable, with E(X) = limp—oo E(An) = f? X(t) dt. E 


On the other hand, there are many functions X which are not Riemann 
integrable, but which nevertheless are random variables with respect to 
Lebesgue measure, and thus have well-defined expected values. For exam- 
ple, if X is defined by X(t) = 1 for t irrational, but X(t) = 0 for ¢ rational, 
ie. X = 1gc where QF is the set of irrational numbers, then X is not 
Riemann-integrable but we still have E(X) = f XdP = 1 being perfectly 
well-defined. 

We sometimes call the expected value of X, with respect to the Lebesgue 
measure probability triple, its Lebesgue integral, and if E|X| < oo we say 
that X is Lebesgue integrable. By Theorem 4.4.1, the Lebesgue integral is a 
generalisation of the Riemann integral. Hence, in addition to learning many 
things about probability theory, we are also learning more about integration 
at the same time! 

We also note that we do not need to restrict attention to integrals over 
the unit interval [0,1]. Indeed, if X : R — [0, 00) is Borel-measurable, then 
we can define j aa X dx, where À stands for Lebesgue measure on the entire 
real line R, by 


i X(t)A(dt) = X` [ x@+oPtao, (4.4.2) 


or equivalently 


o0 
J xoan = EBY). 

TOP neZ 
where Y,(t) = X(n + t) and the expectation is with respect to Lebesgue 
measure on [0,1]. That is, we can integrate a non-negative function over the 
entire real line by adding up its integral over each interval [n,n +1). (For 
general X, we can then write X = X*+ — X~, as usual.) In other words, 
we can represent A, Lebesgue measure on R, as a sum of countably many 
copies of Lebesgue measure on unit intervals. We shall see in Section 6 that 
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we may use Lebesgue measure on R to help us define the distributions of 
general absolutely-continuous random variables. 


Remark 4.4.3. | Measures like \ above, which are not finite but which 
can be written as the countable sum of finite measures, are called o-finite 
measures. They are not probability measures, and cannot be used to define 
probability spaces; however by countable additivity they still satisfy many 
properties that probability measures do. Of course, unlike in the probability 
measure case, EA X dà may be infinite or undefined even if X is bounded. 


4.5. Exercises. 


Exercise 4.5.1. Let (Q, F, P) be Lebesgue measure on [0,1], and set 


1, 0<w<1/4 
X(w) = ¢ Qw*, 1/4<w<3/4 
w, 3/4<w<1. 


Compute P(X € A) where 
(a) A= [0,1]. 
(b) A=[j, 1]. 


Exercise 4.5.2. Let X be a random variable with finite mean, and let 
a € R be any real number. Prove that E(max(X,a)) > max (E(X),a). 
[Hint: Consider separately the cases E(X) > a and E(X) < a.] (See also 
Exercise 5.5.7.) 


Exercise 4.5.3. Give an example of random variables X and Y defined 
on Lebesgue measure on [0, 1], such that P(X > Y) > 4, but E(X) < E(Y). 


Exercise 4.5.4. Let (Q,F,P) be the uniform distribution on Q = 
{1,2,3}, as in Example 2.2.2. Find random variables X, Y, and Z on 
(Q,F,P) such that P(X > Y)P(Y > Z)P(Z > X) > 0, and E(X) = 
E(Y) = E(Z). 


Exercise 4.5.5. Let X be a random variable on (Q, F, P), and suppose 
that is a finite set. Prove that X is a simple random variable. 


Exercise 4.5.6. Let X bea random variable defined on Lebesgue measure 
on [0,1], and suppose that X is a one-to-one function, i.e. that if w1 4 we 
then X(w ,) # X(w2). Prove that X is not a simple random variable. 
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Exercise 4.5.7. (Principle of inclusion-exclusion, general case) Let 
Aj, Ag,..., An E F. Generalise the principle of inclusion-exclusion to: 


m 


P(A41U...UAn) = ÑO P(4) - X P(4N4;) 


i=1 l<icj<n 


+ So P(AIN ASN Ag) -E P(N... N An). 
1<i<j<k<n 


(Hint: Expand 1 — [[;_;(1 — 14,), and take expectations of both sides.] 


Exercise 4.5.8. Let f(x) = ax? + br +c be a second-degree polynomial 
function (where a,b,c € R are constants). 

(a) Find necessary and sufficient conditions on a, b, and ¢ such that the 
equation E(f(a@X)) = a?E(f(X)) holds for all a € R and all random 
variables X. 

(b) Find necessary and sufficient conditions on a, b, and c such that the 
equation E(f(X — 8)) = E(f(X)) holds for all 8 € R. and all random 
variables X. 

(c) Do parts (a) and (b) account for the properties of the variance function? 
Why or why not? 


Exercise 4.5.9. In proving property (4.1.6) of variance, why did we 
not simply proceed by induction on n? That is, suppose we know that 
Var(X + Y) = Var(X) + Var(Y) whenever X and Y are independent. 
Does it follow easily that Var(X + Y + Z) = Var(X) + Var(Y) + Var(Z) 
whenever X, Y, and Z are independent? Why or why not? How does 
Exercise 3.6.6 fit in? 


Exercise 4.5.10. Let X1, X2,... be iid. with mean p and variance o°, 


and let N be an integer-valued random variable with mean m and variance v, 
with N independent of all the X;. Let S = X14+...+Xw = O72) Xi lyzi 
Compute Var(S) in terms of 4, 07, m, and v. 


Exercise 4.5.11. Let X and Z be independent, each with the standard 
normal distribution, let a,b € R (not both 0), and let Y = aX + bZ. 

(a) Compute Corr(X,Y). 

(b) Show that |Corr(X,Y)| < 1 in this case. (Compare Exercise 5.5.6.) 
(c) Give necessary and sufficient conditions on the values of a and b such 
that Corr( X,Y) =1. 

(d) Give necessary and sufficient conditions on the values of a and b such 
that Corr(X,Y) = —1. 
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Exercise 4.5.12. Let X and Y be independent general non-negative 
random variables, and let X, = V,(X), where U,(x) = min(n,2~"|2"2]) 
as in Proposition 4.2.5. 

(a) Give an example of a sequence of functions ®, : (0,00) — (0,00), other 
than ®,(x2) = Yn (z), such that for all x, 0 < n(x) < x and {®,(xr)} Z £ 
as n —> oo. 

(b) Suppose Y, = ®n(Y) with ©, as in part (a). Must X, and Y, be 
independent? 

(c) Suppose {Yn} is an arbitrary collection of non-negative simple random 
variables such that {Yn} Z Y. Must Xn and Y, be independent? 

(d) Under the assumption of part (c), determine (with proof) which quan- 
tities in equation (4.2.7) are necessarily equal. 


Exercise 4.5.13. Give examples of a random variable X defined on 
Lebesgue measure on (0, 1], such that 

(a) E(Xt) = œ and 0 < E(X) < œ. 

(b) E(X~) = œ and 0 < E(XT) < œ. 

(c) E(X+) = E(X“) = 00. 

(d) E(X) < œ but E(X?) = 


Exercise 4.5.14. Let Z1, Z2,... be general random variables with E| Z;| < 
OO, and let Z= Z +a +... 

(a) Suppose X; E(Z}) < œ and $}, E(Z7) < œ. Prove that E(Z) = 
X; E(4%). 

(b) Show that we still have E(Z) = >>; E(Z;) if we have at least one of 
L E(Z7) < œ or 4, E(Z7) < 00. 

(c) Let {Z;} be independent, with P(Z; = +1) = P(Z; = —1) = 4 for each 
i. Does E(Z) = >°, E(Z;) in this case? How does that relate to (4.2.8)? 


Exercise 4.5.15. Let (Q1,71,Pi) and (Q2, F2, P2) be two probability 
triples. Let A1, Ag,... € F,, and B1, Bo,... € F2. Suppose that it happens 
that the sets {An x B,} are all disjoint, and furthermore that U; (An x 
Bn) = Ax B for some A € Fy and B € Fy. 

(a) Prove that for each w € Q4, we have 


14(w)P2(B) = Ñ` 14,(w) Pa(Bn). 


(Hint: This is essentially countable additivity of P2, but you do need to be 
careful about disjointness. | 

(b) By taking expectations of both sides with respect to Pı and using 
countable additivity of Pı, prove that 


P,(A)P2(B) = 57 Pi (An) Po(Bn). 
n=1 
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(c) Use this result to prove that the J and P for product measure, pre- 
sented in Subsection 2.6, do indeed satisfy (2.5.5). 


4.6. Section summary. 


In this section we defined the expected value E(X) of a random variable 
X, first for simple random variables and then for general random variables. 
We proved basic properties such as linearity and order-preserving. We de- 
fined the variance Var(X). If X and Y are independent, then E(XY) = 
E(X)E(Y) and Var(X + Y) = Var(X) + Var(Y). We also proved the 
monotone convergence theorem. Finally, we connected expected value to 
Riemann (calculus-style) integration. 
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5. Inequalities and convergence. 


In this section we consider various relationships regarding expected val- 
ues and limits. 


5.1. Various inequalities. 
We begin with two very important inequalities about random variables. 


Proposition 5.1.1. (Markov’s inequality.) If X is a non-negative random 
variable, then for all a > 0, 


P(X >a) < E(X)/a. 
In words, the probability that X exceeds a is bounded above by its mean 


divided by a. 


Proof. Define a new random variable Z by 


, X(w) 2 
ae i Nya: 


Then clearly Z < X, so that E(Z) < E(X) by the order-preserving prop- 
erty. On the other hand, we compute that E(Z) = aP(X > a). Hence, 
aP(X >a) < E(X). | 


Markov’s inequality applies only to non-negative random variables, but 
it immediately implies another inequality which holds more generally: 


Proposition 5.1.2. (Chebychev’s inequality.) Let Y be an arbitrary 
random variable, with finite mean pry. Then for all a > 0, 


P(Y —py|>a) < Var(Y)/a?. 


In words, the probability that Y differs from its mean by more than a is 
bounded above by its variance divided by a’. 


Proof. Set X = (Y —py)?. Then X is a non-negative random variable. 
Thus, using Proposition 5.1.1, we have 


P (|Y — py| > a) = P(X > a”) < E(X)/a” = Var(Y)/a’. i 


We shall use the above two inequalities extensively, including to prove 
the laws of large numbers presented below. Two other sometimes-useful 
inequalities are as follows. 
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Proposition 5.1.3. (Cauchy-Schwarz inequality.) Let X and Y be 
random variables with E(X?) < oo and E(Y?) < œ. Then E|XY| < 
E(X?) E(¥?). 


Proof. Let Z = |X|/./E(X?) and W = |Y|/ VE(Y?), so that E(Z?) = 
E(W?) = 1. Then 


0 < E((Z-W)?) = E(Z?4+W?-2ZW) = 141-2K(ZW), 


so E(ZW) < 1, ie. E|XY| < \/E(X2) E(Y 3). E 


Proposition 5.1.4. (Jensen’s inequality.) Let X be a random variable 
with finite mean, and let 6: R — R be a convex function, i.e. a function 
such that A(x) + (1 — Ajly) > (àx + (1 — A)y) for z,y,A € R and 
O<A<1. Then E(¢(X)) > $(E(X)). 


Proof. Since ¢ is convex, we can find a linear function g(x) = ax + b 
which lies entirely below the graph of ¢ but which touches it at the point x = 
E(X), i.e. such that g(x) < (2) for all x € R, and g(E(X)) = ¢(E(X)). 
Then 


E(¢(X)) > E(g(X)) = E(aX +b) = aE(X) +b = g(B(X)) = d(E(X)). E 


5.2. Convergence of random variables. 


If Z, Zi, Z2,... are random variables defined on some (Q,F,P), what 
does it mean to say that {Zn} converges to Z as n — 00? 

One notion we have already seen (cf. Theorem 4.2.2) is pointwise con- 
vergence, i.e. liMn-+oo Zn(w) = Z(w). A slightly weaker notion which often 
arises is convergence almost surely (or, a.s. or with probability 1 or w.p. 1 
or almost everywhere), meaning that P(lim,_,.Z, = Z) = 1, ie. that 
P{w € 2: limy_.0oZ,(w) = Z(w)} = 1. As an aid to establishing such 
convergence, we have the following: 


Lemma 5.2.1. Let Z, Zı, Z2,... be random variables. Suppose for each 
€ > 0, we have P(|Z, — Z| > € i.o.) =0. Then P(Z, > Z) = 1, ie. {Zn} 
converges to Z almost surely. 


Proof. It follows from Proposition A.3.3 that 


P(Z, > Z) =P(Ve>0, |Zn-Z]| < €a.a.) = 1~P(Je > 0, |Zp—Z| > € i.o.). 
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By countable subadditivity, we have that 


P(de > 0, € rational, |Z, — Z| > € i.o.) < 5 P(|Z, — Z| > € i.o.) = 0. 


e>0 
e rational 


But given any € > 0, there exists a rational <’ > 0 with e’ < e. For this ¢’, 
we have that {|Zn — Z| > € i.0.} C {|Z, — Z| > €' i.0.}. It follows that 


x 
i 


Je > 0, |Z,—-Z| > ci.0.) < P(de’ > 0, é rational, |Z,—Z| > e i.o.) =0, 
thus giving the result. | 
Combining Lemma 5.2.1 with the Borel-Cantelli Lemma, we obtain: 


Corollary 5.2.2. Let Z, Zı, Z2,... be random variables. Suppose for 
each e > 0, we have }_„ P(|Z, — Z| > €) < œ. Then P(Z, > Z) = 1, ie. 
{Zn} converges to Z almost surely. 


Another notion only involves probabilities: we say that {Zn} converges 
to Z in probability if for all € > 0, limno P(|Zn — Z| > €) = 0. We next 
consider the relation between convergence in probability, and convergence 
almost surely. 


Proposition 5.2.3. Let Z, Z1, Z2,... be random variables. Suppose 
Zn > Z almost surely (i.e., P(Zn > Z) = 1). Then Zn > Z in probability 
(i.e., for any € > 0, we have P(|Zn — Z| > €) > 0). That is, if a sequence of 
random variables converges almost surely, then it converges in probability 
to the same limit. 


Proof. Fix € > 0, and let A, = {w; im > n, |Zm — Z| > €}. Then 
{An} is a decreasing sequence of events. Furthermore, if w € Mn An, 
then Z,(w) Æ Z(w) as n > œ. Hence, P((),, An) < P(Zn Æ Z) = 0. 
By continuity of probabilities, we have P(A,) > P(n An) = 0. Hence, 
P(|Zn — Z| > €) < P(A,) — 0, as required. | 


On the other hand, the converse to Proposition 5.2.3 is false. (This 
justifies the use of “strong” and “weak” in describing the two Laws of 
Large Numbers below.) For a first example, let {Zn} be independent, with 
P(Z, = 1) =1/n=1-P(Z, = 0). (Formally, the existence of such {Zn} 
follows from Theorem 7.1.1.) Then clearly Zn converges to 0 in probability. 
On the other hand, by the Borel-Cantelli Lemma, P(Z, = 1 7.0.) = 1, so 
P(Z, — 0) = 0, so Zn does not converge to 0 almost surely. 

For a second example, let (Q, F, P) be Lebesgue measure on [0,1], and 
set Z, = 10,4) Z2 = LESIE Z3 = 10,4) Z4 = l,i) Zs = 1j4,3)) Ze = 
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ERIE Zz = 10,4): Zg = 114); etc. Then, by inspection, Zn converges to 
0 in probability, but Zn does not converge to 0 almost surely. 


5.3. Laws of large numbers. 
Here we prove a first form of the weak law of large numbers. 


Theorem 5.3.1. (Weak law of large numbers — first version.) Let 
X 1, X2,... be a sequence of independent random variables, each having the 
same mean m, and each having variance < v < oo. Then for all e > 0, 


1 
lim P (Za +Xot...+¢ Xn) - m > e) = 0. 
n 


n= 


In words, the partial averages +(X; +X2+...+ Xn) converge in probability 


n 
tom. 


Proof. Set S, = 1(Xı + X2 +... + Xn). Then using linearity of ex- 
pected value, and also properties (4.1.5) and (4.1.6) of variance, we see 
that E(S,) = m and Var(S,,) < v/n. Hence by Chebychev’s inequality 
(Theorem 5.1.2), we have 


1 
p (|E + xa +... Xn) = m > €) < v/en > 0, n> oo, 


as required. | 


For example, in the case of infinite coin tossing, if X; = r; = 0 or 1 as 
the i“ coin is tails or heads, then Theorem 5.3.1 states that the probability 
that the fraction of heads on the first n tosses differs from 3 by more than € 
goes to 0 as n — oo. Informally, the fraction of heads gets closer and closer 
to 7 with higher and higher probability. 


We next prove a first form of the strong law of large numbers. 
Theorem 5.3.2. (Strong law of large numbers — first version.) Let 


X1,X2,... be a sequence of independent random variables, each having the 
same finite mean m, and each having E((X; — m)*) < a < œ. Then 


1 
P (tim Se Se koi oS m) =1. 
n= n 


In words, the partial averages (X 1+X2+...+Xn) converge almost surely 
to m. 
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Proof. Since (X;—m)? < (X;—m)*+1 (consider separately the two cases 
(Xi — m)? < 1 and (X; — m}? > 1), it follows that each X; has variance 
<a+1=v < œ. We assume for simplicity that m = 0; if not then we can 
simply replace X; by X; — m. 

Let Sn = X1+X2+...+Xn, and consider E($4). Now, S4 will (when 
multiplied out) contain many terms of the form X,XjX~Xe for i,j,k, £ 
distinct, but all of these have expected value zero. Similarly, it will contain 
many terms of the form X;X;(X,)* and X;(X;)? which also have expected 
value zero. The only terms with non-vanishing expectation will be n terms 
of the form (X;)*, and (5) (3) = 3n(n — 1) terms of the form (X;)?(X;)? 
with i Æ j. Now, E ((X;)*) <a. Furthermore, if i # j then X? and xX? are 
independent by Proposition 3.2.3, so since m = 0 we have E ((X;)?(X;)?) = 
E ((X;)?) E ((X;)?) = Var(X:)Var(X;) < v?. We conclude that E(9S4) < 
na + 3n(n — 1)v? < Kn? where K = a + 3v?. This is the key. 

To finish, we note that for any € > 0, we have by Markov’s inequality 


that 
p(s 


< E (3%) (mé) < Kn? / (ntt) = Ket 


> e) = P(|S,| > ne) =P (|Sn|* > ntet) 


Since 0°, Æ < œ by (A.3.7), it follows from Corollary 5.2.2 that +S, 
converges to 0 almost surely. | 


For example, in the case of coin tossing, Theorem 5.3.2 states that the 
fraction of heads on the first n tosses will converge, as n — oo, to $. Al- 
though this conclusion sounds quite similar to the corresponding conclusion 
from Theorem 5.3.1, we know from Proposition 5.2.3 (and the examples 


following) that it is actually a stronger result. 


5.4. Eliminating the moment conditions. 


Theorems 5.3.1 and 5.3.2 provide clear evidence that the partial sums 
1(Xı +... + Xn) are indeed converging to the common mean m in some 
sense. However, they require that the variance (i.e., second moment) or 
even the fourth moment of the X; be finite (and uniformly bounded). This 
is an unnatural condition which is sometimes difficult to check. 

Thus, in this section we develop a new form of the strong law of large 
numbers which requires only that the mean (i.e., first moment) of the ran- 
dom variables be finite. However, as a penalty, it demands that the random 
variables be “i.i.d.” as opposed to merely independent. (Of course, once we 
have proven a strong law, we will immediately obtain a weak law by using 
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Proposition 5.2.3.) Our proof follows the approach in Billingsley (1995). 
We begin with a definition. 


Definition 5.4.1. A collection of random variables {Xq}acy are iden- 
tically distributed if for any Borel-measurable function f : R — R, the 
expected value E(f(X.)) does not depend on a, i.e. is the same for all 
acl. 


Remark 5.4.2. It follows from Proposition 6.0.2 and Corollary 6.1.3 
below that {Xq}aer are identically distributed if and only if for all z € R, 
the probability P (Xa < xz) does not depend on a. 


Definition 5.4.3. A collection of random variables {Xa}qe, are i.i.d. if 
they are independent and are also identically distributed. 


Theorem 5.4.4. (Strong law of large numbers — second version.) Let 
X 1, X2,... be a sequence of i.i.d. random variables, each having finite mean 
m. Then 


1 


In words, the partial averages 1(X 1+X2+...+X;,,) converge almost surely 
tom. 


Proof. The proof is somewhat difficult because we do not assume the 
finiteness of any moments of the X; other than the first. Instead, we shall 
use a “truncation argument”, by defining new random variables Y; which 
are truncated versions of the corresponding X;. Thus, higher moments will 
exist for the Y;, even though the Y; will tend to be “similar” to the X;. 

To begin, we assume that X; > 0; if not, we can consider separately KF 
and X; . (Note that we cannot now assume without loss of generality that 
m = 0.) We set Y; = Xi1x,<i, ie. Yi = X; unless X; exceeds 7, in which 
case Y; = 0. Then since 0 < Y; < i, therefore E(Y;*) < i* < œœ. Also, by 
Proposition 3.2.3, the {Y;} are independent. Furthermore, since the X; are 
ii.d., we have that E(Y;) = E(Xj1x,<;) = E(X11x,<i) 7 E(X1) = m as 
i —> oo, by the monotone convergence theorem. 

We set Sn = J; Xi, and set S* = 77, Y;. We compute using (4.1.6) 
and (4.1.4) that 


Var(S*) = Var(Yi) +... + Var(Yn) < E(YŽ) +... + E(¥?) 


= E(X71x,<1) +... + E(X21x, <n) < nE(X?1x, <n) < n? < 00. (5.4.5) 


We now choose a > 1 (we will later let a N 1), and set un = |a@”], i.e. 
Um is the greatest integer not exceeding a”. Then un < a”. Furthermore, 
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since a” > 1, it follows (consider separately the cases a” < 2 and a” > 2) 
that un > a” /2, i.e. 1/un < 2/a”. Hence, for any x > 0, we have that 


3 1/un < 3 Iun < £ 2/a” < 3 2/a* = oie, _ (5.4.6) 


Q 


k=log, © 


us oe ands anos 


(Note that here we sum over k = log, x, loga £ + 1,..., even if log, x is not 
an integer.) 
We now proceed to the heart of the proof. For any € > 0, we have that 
St, 7E(Si,) 
ane Somi 2 €) 
ùn Sin 
= Era P tie Ee Ne e) 


< Yeas by Chebychev’s inequality 
= ye, See by (4.1.5) 

< De, Ese) by (5.4.5) 

= 4E(X? re, = insa) by countable linearity 
< B(X? 4) by (5.4.6) 

= e 5 E(X) 


This finiteness is the key. It now follows from Corollary 5.2.2 that 


Sa —E(Si 
{=} converges to 0 almost surely . (5.4.7) 
m 


To complete the proof, we need to replace Pain) by m, and replace 
Si, by Sun, and finally replace the index un by the general index k. We 
consider each of these three issues in turn. 

First, since E(Y;) > m as i — oo, and since un, — œ as n > œ, it 


follows immediately that ES) — mas n — oo. Hence, it follows from 
(5.4.7) that 
S* 
{ oe } converges to m almost surely . (5.4.8) 
Un 


Second, we note by Proposition 4.2.9 that 


o0 


P(X, Yn) = > P(X >k) < SO P(X: >k) 


1 k=1 k=1 


Me, 


> 
i 
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= SS P(X: >k) = E| X] <E(Xı) sm< œ. 


Hence, again. by Borel-Cantelli, we have that P(X, 4 Yp i.0.) = 0, so that 
P(X, = x a.a. ilies = 1. It follows that, ne PR one, as n — œ the 


. Hence, (5.4.8) implies that 


limit of © 


Su 
{ z converges to m almost surely . (5.4.9) 
Un 


Finally, for an arbitrary index k, we can find n = nx so that un < k < 
Un+i- But then 


Un Sun z Sun < Sk < Sunga = Siagi Un+1 (5 4 10) 


Un+t1 Un Un+1 k~ un Un+1 Un 


Now, as k > œ, we have n = ng — œ, so that ma 4 and ath > a. 
Hence, for any a > 1 and 6 > 0, with probability 1 we have mii. 1 ôja < 
Se < (1+ 6)am for all sufficiently large k. For any € > 0, choosing a > 1 
and ô > 0 so that m/(1+d)a > m -~e and (1+d)am < m +e, this implies 
that P(|(S,/k) —m| > e i.o.) = 0. Hence, by Lemma 5.2.1, we have that as 
k => œ, 


Sk 
=, converges to m almost surely , 


as required. | 


Using Proposition 5.2.3, we immediately obtain a corresponding state- 
ment about convergence in probability. 


Corollary 5.4.11. (Weak law of large numbers — second version.) Let 
X1,X,... be a sequence of i.i.d. random variables, each having finite mean 
m. Then for all e > 0, 


1 


NOOO 


In words, the partial averages 1(X 1+X2+...+Xn) converge in probability 
tom. 
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5.5. Exercises. 
Exercise 5.5.1. Suppose E(2*) = 4. Prove that P(X > 3) < 1/2. 


Exercise 5.5.2. Give an example of a random variable X and a > 0 such 
that P(X > a) > E(X)/a. [Hint: Obviously X cannot be non-negative. ] 
Where does the proof of Markov’s inequality break down in this case? 


Exercise 5.5.3. Give examples of random variables Y with mean 0 and 
variance 1 such that 

(a) P(Y] > 2) = 1/4. 

(b) P(|Y| > 2) < 1/4. 


Exercise 5.5.4. Suppose X is a non-negative random variable with 
E(X) = co. What does Markov’s inequality say in this case? 


Exercise 5.5.5. Suppose Y is a random variable with finite mean py 
and with Var(Y) = oo. What does Chebychev’s inequality say in this case? 


Exercise 5.5.6. For general jointly defined random variables X and 
Y, prove that |Corr(X,Y)| < 1. [Hint: Don’t forget the Cauchy-Schwarz 
inequality.] (Compare Exercise 4.5.11.) 


Exercise 5.5.7. Let a €R, and let (x) = max(z, a) as in Exercise 4.5.2. 
Prove that ¢ is a convex function. Relate this to Jensen’s inequality and to 
Exercise 4.5.2. 


Exercise 5.5.8. Let (x) = 2”. 

(a) Prove that ¢ is a convex function. 

(b) What does Jensen’s inequality say for this choice of 6? 

(c) Where in the text have we already seen the result of part (b)? 


Exercise 5.5.9. Prove Cantelli’s inequality, which states that if X is a 
random variable with finite mean m and finite variance v, then for a > 0, 


v 
AAI 2 0) aa 


[Hint: First show P(X — m > a) < P((X —m+y)? > (a+y)?). Then 
use Markov’s inequality, and minimise the resulting bound over choice of 
y >0.] 


Exercise 5.5.10. Let X1, X2,... be a sequence of random variables, with 
E[X,,] = 8 and Var[Xn] = 1/,/n for each n. Prove or disprove that {Xn} 
must converge to 8 in probability. 
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Exercise 5.5.11. Give (with proof) an example of a sequence {Yn} of 
jointly-defined random variables, such that as n — oo: (i) Y,/n converges 
to 0 in probability; and (ii) Y/n? converges to 0 with probability 1; but 
(iii) Yn/n does not converge to 0 with probability 1. 


Exercise 5.5.12. Give (with proof) an example of two discrete random 
variables having the same mean and the same variance, but which are not 
identically distributed. 


Exercise 5.5.13. Let re N. Let X1, X2,... be identically distributed 
random variables having finite mean m, which are r-dependent, i.e. such 
that Xx,,Xk.,---,X%; are independent whenever ki+ı > ki +r for each 7. 
(Thus, independent random variables are 0-dependent.) Prove that with 
probability one, + 7", X; —> m as n — œ. [Hint: Break up the sum 


n i=l 
v1 X; into r different sums.] 


Exercise 5.5.14. Prove the converse of Lemma 5.2.1. That is, prove 
that if {Xn} converges to X almost surely, then for each € > 0 we have 
P(|Xn — X| > € i.o.) = 0. 


Exercise 5.5.15. Let X1, X2,... be a sequence of independent random 
variables with P(X, = 3") = P(X, = —3") = $. Let Sn = Xi +... + Xn. 
(a) Compute E(X,,) for each n. 

(b) For n € N, compute R, = sup{r € R; P(|S,| > r) = 1}, i.e. the 
largest number such that |S,| is always at least Rp. 

(c) Compute limno + Rn. 

(d) For which e > 0 (if any) is it the case that P(+|S,| > €) A 0? 

(e) Why does this result not contradict the various laws of large numbers? 


5.6. Section summary. 


This section presented inequalities about random variables. The first, 
Markov’s inequality, provides an upper bound on P(X > a) for non- 
negative random variables X. The second, Chebychev’s inequality, provides 
an upper bound on P(|Y — py| > a) in terms of the variance of Y. The 
Cauchy-Schwarz and Holder inequalities were also discussed. 

It then discussed convergence of sequences of random variables, and pre- 
sented various versions of the Law of Large Numbers. This law concerns 
partial averages 1(X 1+.. .+Xn) of collections of independent random vari- 
ables {Xn} all having the same mean m. Under the assumption of either 
finite higher moments (First Version) or identical distributions (Second Ver- 
sion), we proved that these partial averages converge in probability (Weak 
Law) or almost surely (Strong Law) to the mean m. 
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6. Distributions of random variables. 
The distribution or law of a random variable is defined as follows. 


Definition 6.0.1. Given a random variable X on a probability triple 
(Q,F,P), its distribution (or law) is the function defined on B, the Borel 
subsets of R, by 


(B) = P(X € B) = P(X-"(B)), BeB. 


If u is the law of a random variable, then (R, B, p) is a valid probability 
triple. We shall sometimes write u as £(X) or as PXT! We shall also 
write X ~ u to indicate that p is the distribution of X. 

We define the cumulative distribution function of a random variable X 
by Fx(z) = P(X < zx), for x € R. By continuity of probabilities, the 
function Fy is right-continuous, i.e. if {£n} N x then Fx (tp) > Fx (a). It 
is also clearly a non-decreasing function of x, with lim, .. Fx(x) = 1 and 
limgs—-o Fx (x) = 0. We note the following. 


Proposition 6.0.2. Let X and Y be two random variables (possibly 
defined on different probability triples). Then £(X) = L(Y) if and only if 
Fx (x) = Fy(«) forall z eR. 


Proof. The “if” part follows from Corollary 2.5.9. The “only if” part is 
immediate upon setting B = (~o, z]. | 


6.1. Change of variable theorem. 


The following result shows that distributions specify completely the ex- 
pected values of random variables (and functions of them). 


Theorem 6.1.1. (Change of variable theorem.) Given a probability 
triple (0,F,P), let X be a random variable having distribution p. Then 
for any Borel-measurable function f : R —> R, we have 


[xy Pa) = | T O plat) , (6.1.2) 
Q —oo 


ie. Ep[f(X)] = E,,(f), provided that either side is well-defined. In words, 
the expected value of the random variable f(X) with respect to the proba- 
bility measure P on Q is equal to the expected value of the function f with 
respect to the measure u on R. 
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Proof. Suppose p that f = F is an indicator function of a Borel 
set BCR. att re X(w)) P(dw) = fo Hxwes}P (dw) = P(X € B), 
while f°. f(t) aoe ltreB}L ae = p(B) = P(X € B), so equality 
holds in this case. 

Now suppose that f is a non-negative simple function. Then f is a finite 
positive linear combination of indicator functions. But since both sides of 
(6.1.2) are linear functions of f, we see that equality still holds in this case. 

Next suppose that f is a general non-negative Borel-measurable func- 
tion. Then by Proposition 4.2.5, we can find a sequence { fn } of non-negative 
simple functions such that {fn} Z f. We know that (6.1.2) holds when f 
is replaced by f,. But then by letting n — oo and using the Monotone 
Convergence Theorem (Theorem 4.2.2), we see that (6.1.2) holds for f as 
well. 

Finally, for general Borel-measurable f, we can write f = ft—f~. Since 
(6.1.2) holds for f+ and for f~ separately, and since it is linear, therefore 
it must also hold for f. | 


Remark. The method of proof used in Theorem 6.1.1 (namely con- 
sidering first indicator functions, then non-negative simple functions, then 
general non-negative functions, and finally general functions) is quite widely 
applicable; we shall use it again in the next subsection. 


Corollary 6.1.3. Let X and Y be two random variables (possibly de- 
fined on different probability triples). Then L(X) = L(Y) if and only if 
E[f(X)] = E[f(Y)] for all Borel-measurable f : R — R for which either 
expectation is well-defined. (Compare Proposition 6.0.2 and Remark 5.4.2.) 


Proof. If L(X) = L(Y) = pu (say), then Theorem 6.1.1 says that 
EIX) = EO] = fy f dy 

Conversely, if E[f(X)] = E[f(¥)] for all Borel-measurable f : R > R, 
then setting f = 1g shows that P[X € B] = P[Y € B] for all Borel B C R, 
ie. that £(X) = L(Y). | 


Corollary 6.1.4. If X and Y are random variables with P(X =Y)=1, 
then E[f(X)] = E[f(Y)] for all Borel-measurable f : R — R for which 
either expectation is well-defined. 


Proof. It follows directly that L(X) = L(Y). Then, letting u = L(X) = 
L(Y), we have from Theorem 6.1.1 that E[f(X)]=E[f(Y)] = fr fdu- E 
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6.2. Examples of distributions. 


For a first example of a distribution of a random variable, suppose that 
P(X = c) = 1, ie. that X is always (or, at least, with probability 1) 
equal to some constant real number c. Then the distribution of X is the 
point mass ô., defined by 6-(B) = 1g(c), i.e. 6-(B) equals 1 if c € B and 
equals 0 otherwise. In this case we write X ~ ĝe, or L(X) = ôe. From 
Corollary 6.1.4, since Pa S = ¢) = 1, we have E(X) = E(c ne = c, and 
E(X? +2) = E( +2) =c? +2, and more 2 y = f(c) for any 
function f. In symbols, fo f(X(w)) P(dw) = fR f( = Ae ‘ ae is, 
the mapping f +> E[f(X)| is an evaluation map. 

For a second example, suppose X has the Poisson(5) distribution con- 
sidered earlier. Then P(X € A) = Dye, e—°57/9!, which implies that 
L(x) = E paole75i/4!)ô;, a convex combination of point masses. The fol- 
lowing proposition shows that we then have E(f(X)) = Do f(j) e7°59 /9! 
for any function f : R > R. 


Proposition 6.2.1. Suppose p = 3°; ßipr, where {p;i} are probability 
distributions, and {;} are non-negative constants (summing to 1, if we 
want u to also be a probability distribution). Then for Borel-measurable 
functions f: R > R, 


fiu = Ea f tam, 


provided either side is well-defined. 


Proof. As in the proof of Theorem 6.1.1, it suffices (by linearity and 
the monotone convergence theorem) to check the equation when f = 18 is 
an indicator function of a Borel set B. But in this case the result follows 
immediately since p(B) = $; Gipni(B). | 


Clearly, any other discrete random variable can be handled similarly to 
the Poisson(5) example. Thus, discrete random variables do not present 
any substantial new technical issues. 

For a third example of a distribution of a random variable, suppose X 
has the Normal(0,1) distribution considered earlier (henceforth denoted 
N(0,1)). We can define its law pny by 


z f. g(t)Lp(t)A(dt), B Borel, (6.2.2) 


where A is Lebesgue measure on R (cf. (4.4.2)) and where ¢(t) = Fee Fi)? 


T 
We note that for a mathematically complete definition, it is necessary to 
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use the Lebesgue integral rather than the Riemann integral. Indeed, the 
Riemann integral is undefined unless B is a rather simple set (e.g. a finite 
union of intervals, or more generally a set whose boundary has measure 
0), while we need un (B) to be defined for all Borel sets B. Furthermore, 
since @ is continuous it is Borel-measurable (Proposition 3.1.8), so Lebesgue 
integrals such as (6.2.2) make sense. 

Similarly, given any Borel-measurable function (called a density func- 
tion) f such that f > 0 and f°. f(t)A(dt) = 1, we can define a law p 
by 


= i f(t)1(t)A(dt),  B Borel. 


We shall sometimes write this as u(B) = fp f or u(B) = fp f(t) A(dt), or 
even as (dt) = f(t) A(dt) (where such equalities of “differentials” have the 
interpretation that the two sides are equal when integrated over t € B for 
any Borel B, i.e. that fp (dt) = fp f(t)A(dt) for all B). We shall also write 
this as gk = f, and shall say that u is absolutely continuous with respect 
to A, and that f is the density for p with respect to A. We then have the 
following. 


Proposition 6.2.3. Suppose u has density f with respect to A. Then 
for any Borel-measurable function g : R > R, 


o0 CO 

E,(9) = g(t)u(dt) = J g(t) F(t)A(dt) , 
—oo o 

provided either side is well-defined. In words, to compute the integral of a 

function with respect to u, it suffices to compute the integral of the function 

times the density with respect to X. 


Proof. Once again, it suffices to check the equation when g = 1g is 
an anne function of a Borel set B. But in that case, f g(t)u(dt) = 


fist) = p(B), while f g(t) f(H)Aldt) = f 1p (t) f(HA(dt) = u(B) by 
a ue result follows. | 


By combining Theorem 4.4.1 and Proposition 6.2.3, it is possible to 
do explicit computations with absolutely-continuous random variables. For 
example, if X ~ N(0,1), then 


B(x) = f tuna) = frowa f t o(t) dt, 


— 00 


and more generally 
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for any Riemann-integrable function g; here the last expression is an ordi- 
nary, old-fashioned, calculus-style Riemann integral. It can be computed in 
this manner that E(X) = 0, E(X?) = 1, E(X*) = 3, etc. 

For an example combining Propositions 6.2.3 and 6.2.1, suppose that 
L(X) = 46, + $6. + Zun, where uy is again the N(0,1) distribution. 
Then E(X) = 4(1) + 4(2) + 3(0) = 2, E(X?) = 4(1) + 4$(4) +30) = 3, 
and so on. Note, however, that it is not the case that Var(X) equals the 
corresponding linear combination of variances (indeed, the variance of a 
point-mass is 0, so that the corresponding linear combinations of variances 
is (0) + 4(0)+4(1) = 4); rather, the formula Var(X) = E(X?) —E(X)? = 
1- ($P = 4 should be used. 


Exercise 6.2.4. Why does Proposition 6.2.1 not imply that Var(X) 
equals the corresponding linear combination of variances? 


6.3. Exercises. 


Exercise 6.3.1. Let (Q, F, P) be Lebesgue measure on [0, 1], and set 


1, O<w<i/4 
X(w) = 2w, 1/4 <w < 3/4 
w, 3/4<w<l. 


Compute P(X € A) where 
(a) A= [0,1]. 
(b) A= [3:1]. 


Exercise 6.3.2. Suppose P(Z = 0) = P(Z = 1) ot that Y ~ N(0,1), 
and that Y and Z are independent. Set X = Y Z. What is the law of X? 


Exercise 6.3.3. Let X ~ Poisson(5). 
(a) Compute E(X) and Var(X). 
(b) Compute E(3*). 


Exercise 6.3.4. Compute E(X), E(X?), and Var(X), where the law of 
X is given by 

(a) L(X) = $6, + $A, where À is Lebesgue measure on [0, 1]. 

(b) L(X) = $62 + ĉuyn, where uy is the standard normal distribution 
N(0,1). 


Exercise 6.3.5. Let X and Z be independent, with X ~ N(0,1), and 
with P(Z = 1) = P(Z = —1) = 1/2. Let Y = XZ (i.e., Y is the product of 
X and Z). 
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(a) Prove that Y ~ N(0,1). 

(b) Prove that P(|X| = |Y]) = 1. 

(c) Prove that X and Y are not independent. 

(d) Prove that Cov(X,Y) =0. 

(e) It is sometimes claimed that if X and Y are normally distributed ran- 
dom variables with Cov(X,Y) = 0, then X and Y must be independent. 
Is that claim correct? 


Exercise 6.3.6. Let X and Y be random variables on some probability 
triple (Q,F,P). Suppose E(X*) < oo, and that P[m < X < z] = Pim < 
Y < z] for all integers m and all z € R. Prove or disprove that we necessarily 
have E(Y*) = E(X‘). 


Exercise 6.3.7. Let X be a random variable, and let Fy (x) be its cumu- 
lative distribution function. For fixed x € R, we know by right-continuity 
that lim, Ne Fx (y) = Fx (2). 

(a) Give a necessary and sufficient condition that lim, >z Fx (y) = Fx (x). 
(b) More generally, give a formula for F'x (x) — (limy >x Fx (y)), in terms 
of a simple property of X. 


Exercise 6.3.8. Consider the statement: f(x) = (Fæ) for alla eR. 
(a) Prove that the statement is true for all indicator functions f = lg. 
(b) Prove that the statement is not true for the identity function f(x) = z. 
(c) Why does this fact not contradict the method of proof of Theorem 6.1.1? 


6.4. Section summary. 


This section defined the distribution (or law), £(X), of a random vari- 
able X, to be a corresponding distribution on the real line. It proved that 
£(X) is completely determined by the cumulative distribution function, 
Fx (a) = P(X < zx), of X. It proved that expectation E(f(X)) of any func- 
tion of X can be computed (in principle) once £(X) or F'x (x) is known. It 
then considered a number of examples of distributions of random variables, 
including discrete and continuous random variables and various combina- 
tions of them. It provided a number of results for computing expected 
values with respect to such distributions. 
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7. Stochastic processes and gambling games. 


Now that we have covered most of the essential foundations of rigor- 
ous probability theory, it is time to get “moving”, i.e. to consider random 
processes rather than just static random variables. 

A (discrete time) stochastic process is simply a sequence Xo, X1, X9,... 
of random variables defined on some fixed probability triple (Q, F, P). The 
random variables {X,} are typically not independent. In this context we 
often think of n as representing time; thus, X, represents the value of a 
random quantity at the time n. 

For a specific example, let (r1,7r2,...) be the result of infinite fair coin 
tossing (so that {r;} are independent, and each r; equals 0 or 1 with prob- 
ability 4; see Subsection 2.6), and set 


Xo =0; Xn =ritroat...tm, nl; (7.0.1) 


thus, Xn represents the number of heads obtained up to time n. Alterna- 
tively, we might set 


Xo =0; Xn = 2(r1 +r2 +...+rn) >n, n> l; (7.0.2) 


then Xn represents the number of heads minus the number of tails obtained 
up to time n. This last example suggests a gambling game: each time we 
obtain a head we increase Xn by 1 (i.e., we “win”), while each time we 
obtain a tail we decrease Xn by 1 (i.e., we “lose”). 

To allow for non-fair games, we might wish to generalise (7.0.2) to 


Xo =0; Xn = M+ Zot+...+¢ Zn, n>1, 


where the {Z;} are assumed to be i.i.d. random variables, satisfying P(Z; = 
1) = p and P(Z; = —1) = 1 — p, for some fixed 0 < p< 1. (If p = 5 then 
this is equivalent to (7.0.2), with Z; = 2r; — 1.) 

This raises an immediate issue: can we be sure that such random vari- 
ables {Z;} even exist? In fact the answer to this question is yes, as the 
following subsection shows. 


7.1. A first existence theorem. 


We here show the existence of sequences of independent random vari- 
ables having prescribed distributions. 


Theorem 7.1.1. Let pı, H2,... be any sequence of Borel probability 
measures on R. Then there exists a probability space (Q, F, P), and random 
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variables X,,X2,... defined on (Q,F,P), such that {Xn} are independent, 
and L(Xn) = Hn. 


We begin with a lemma. 


Lemma 7.1.2. Let U be a random variable whose distribution is Lebesgue 
measure (i.e., the uniform distribution) on [0,1]. Let F be any cumulative 
distribution function, and set ¢(u) = inf{x; F(x) > u} for 0< u < 1. Then 
P(¢(U) < x) = F(z) for each z € R; in words, the cumulative distribution 
function of @(U) is F. 


Proof. Since F is non-decreasing, if F(z) < u, then ¢(u) > z. On the 
other hand, since F is right-continuous, we have that inf{x; F(x) > u} = 
min{z;F (x) > u}; that is, the infimum is actually obtained. It follows 
that if F(z) > u, then ¢(u) > z. Hence, ¢(u) < z if and only if u < F(z). 
Since 0 < F(z) < 1, we obtain that P(¢(U) < z) = P(U < F(z)) = F(z). E 


Proof of Theorem 7.1.1. We let (Q, F,P) be infinite independent fair 
coin tossing, so that ri,7r2,... are iid. with P(r; = 0) = P(r; = 1) = 5. 
Let {Zij} be a two-dimensional array filled by these r;, as follows: 


Zu Z12 Z «-. ri 73 Te 
Za 222 Zə ... r2 15 

Z31 Z32 233... | = | 7 rs 

Zu Za 243 =.. r7 


Hence, {Z;;j} are independent, with P(Z;; = 0) = P(Zjj = 1) = 3. 
Then, for each n € N, we set Un = } gı Znk/2*. By Corollary 3.5.3, 


By additivity and continuity of probabilities, this implies that P(a < Un < 
b) = b — a whenever 0 <a < b < 1. Hence, by Proposition 2.5.8, each Un 
follows the uniform distribution (i.e. Lebesgue measure) on (0, 1]. 

Finally, we set Fp (£) = pin ((—00, 2]) for x € R, set $,(u) = inf{z; u < 
F,,(x)} for 0 < u < 1, and set Xn = ¢n(Un). Then {Xn} are independent 
by Proposition 3.2.3, and Xn ~ Hn by Lemma 7.1.2, as required. i 
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7.2. Gambling and gambler’s ruin. 


By Theorem 7.1.1, for fixed 0 < p < 1 we can find random variables 
{Z;} which are iid. with P(Z; = 1) =p and P(Z; = -1) =1—p=q. We 
then set Xn =a+2Z,+Zo+...+Z, (with Xo = a) for some fixed integer a. 
We shall interpret X,, as a gambling player’s “fortune” (in dollars) at time 
n when repeatedly making $1 bets, and shall refer to the stochastic process 
{Xn} as simple random walk. Thus, our player begins with $a, and at each 
time has probability p of winning $1 and probability q = 1 — p of losing $1. 

We first note the distribution of Xn. Indeed, clearly P(X, =a+k) =0 
unless —n < k <n with n + k even. For such k, there are (atr) different 
possible sequences Z4,..., Zn such that Xn = a + k, namely all sequences 
consisting of rtk symbols +1 and = symbols —1. Furthermore, each such 


sequence has probability preg. We conclude that 
n n = 
P(X, =a +k) = (ajo T, —n<k<n, n+k even, 
TZN 


with P(X, = a + k) = 0 otherwise. 

This is a rather “static” observation about the process {X,,}; of greater 
interest are questions which depend on its time-evolution. One such ques- 
tion is the gambler’s ruin problem, defined as follows. Suppose that 0 < 
a < c, and let 7 = inf{n > 0; Xn = 0} and 7, = inf{n > 0; Xn = c} be 
the first hitting time of 0 and c, respectively. (These infima are taken to 
be +oo if the condition is satisfied for no n.) The gambler’s ruin question 
is, what is P(t. < 7)? That is, what is the probability that the player’s 
fortune will reach the value c before it reaches the value 0. Informally, what 
is the probability that the gambler gets rich before going broke? (Note that 
{Te < To} includes the case when 7) = oo while re < 00; but it does not 
include the case when Te = To = 00.) 

Solving this question is not straightforward, since there is no limit to 
how long it will take until either the fortune c or the fortune 0 is reached. 
However, by using the right trick, the solution presents itself. We set s(a) = 
8¢,p(@) = P(t. < To). Writing the dependence on a explicitly will allow us 
to vary a, and to relate s(a) to s(a—1) and s(a+1). Indeed, for 1 < a < c—1 
we have that 


s(a) P(t. < To) 
= P(X, ==]; te < To) +P(Z, =41,%< To) (7.2.1) 


qs(la—1)+ps(a +1). 


il 


That is, s(a) is a simple convex combination of s(a — 1) and s(a + 1); this 
is the key. We further have by definition that s(0) = 0 and s(c) = 1. 
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Now, (7.2.1) gives c—1 equations, for the c—1 unknowns s(1),...,8(c—1). 
This system of equations can then be solved in several different ways (see 
Exercises 7.2.4, 7.2.5, and 7.2.6 below), to obtain that 


O oa 
s(a) = sepa) = ey pes. (7.2.2) 
and i 
sla) = Sep(a) = a/c, P=>5- (7.2.3) 


(This last equation is suggestive: for a fair game (p = 3), with probability 
a/c you end up with c dollars; and the product of these two quantities is 
your initial fortune a. We will consider this issue again when we study 
martingales; see in particular Exercises 14.4.10 and 14.4.11.) 


Exercise 7.2.4. Verify that (7.2.2) and (7.2.3) satisfy (7.2.1), and also 
satisfy s(0) = 0 and s(c) = 1. 


Exercise 7.2.5. Solve equation (7.2.1) by direct algebra, as follows. 
(a) Show that (7.2.1) implies that for 1 < a < c—1, s(a + 1) — s(a) 
2(s(a) — s(a — 1)). 

(b) Show that this implies that for 0 < a < c-— 1, s(a + 1) — s(a) 


a 
(3) s(1). 
(c) Show that this implies that for 0 < a < c, s(a) = ae (a) s(1). 
(d) Solve for s(1), and verify (7.2.2) and (7.2.3). 


il 


Exercise 7.2.6. Solve equation (7.2.1) using the theory of difference 
equations, as follows. 

(a) Show that the corresponding “characteristic equation” t? = gt! + pt? 
has two distinct roots tı and t2 when p Æ 1/2, and one double root t3 when 
p= 1/2. Solve for tı, t2, and t3. 

(b) When p # 1/2, the theory of difference equations says that we must 
have sc pla) = Cy(t1)* + C2(t2)° for some constants C; and C2. Assuming 
this, use the boundary conditions Sc,p(0) = 0 and s¢,)(c) = 1 to solve for Cy 
and C2. Verify (7.2.2). 

(c) When p = 1/2, the theory of difference equations says that we must 
have 8¢,p(a) = C'3(t3)° + C4a(t3)” for some constants C3 and Cy. Assuming 
this, use the boundary conditions to solve for C3 and C4. Verify (7.2.3). 


As a specific example, suppose you start with $9,700 (i-e., a = 9700) and 
your goal is to win $10,000 before going broke (i.e., c = 10000). If p = 5, 
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then your probability of success is a/c = 0.97, which is very high; on the 
other hand, if p = 0.49, then your probability of success is given by 


0.51 \ 9700 0.51 \ 10000 
h Ga) | j! h (Sa) | 
which is approximately 6.1 x 1076, or about one chance in 163,000. This 
shows rather dramatically that even a small disadvantage on each bet can 
lead to a very large disadvantage in the long run! 

Now let r(a) = repa) = P (To < Te) be the probability that our gambler 
goes broke before reaching the desired fortune. Then clearly s(a) and r(a) 
are related. Indeed, by “considering the bank’s point of view”, we see imme- 
diately that re pla) = $¢,1-p(ce—a) (that is, the chance of going broke before 
obtaining c dollars, when starting with a dollars and having probability p 
of winning each bet, is the same as the chance of obtaining c dollars before 
going broke, when starting with c — a dollars and having probability 1 — p 
of winning each bet), so that 


uy Zi 
1-2)? PF? 
Tc,p(a) = 
- $ 
5, P=3- 


Finally, let us consider the probability that the gambler will eventually 
go broke if they never stop gambling, i.e. P(7> < co) without regard to 
any target fortune c. Well, if we let He = {To < Tc}, then clearly {He} is 
increasing up to {To < oo}, as c — oo. Hence, by continuity of probabilities, 


ps3 
P(t < œ) = lim P(H,) = lim re pla) = 


coo 


(7.2.7) 


a 
Sha 
Nw” 
a 
3 
V 
dle 


Thus, if p < 2 (i.e., if the gambler has no advantage), then the gambler is 
certain to eventually go broke. On the other hand, if say p = 0.6 and a = 1, 
then the gambler has probability 1/3 of never going broke. 


7.3. Gambling policies. 


Suppose now that our gambler is allowed to choose how much to bet 
each time. That is, the gambler can choose random variables W,, so that 
their fortune at time n is given by 


Xn = €+Wi2,+WoZo+...+WrZn, 
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with {Zn} as before. To avoid “cheating”, we shall insist that W,, > 0 (you 
can’t bet a negative amount), and also that Wp = fn(Zi, Z2,..-,Zn—1) is a 
function only of the previous bet results (you can’t know the result of a bet 
before you choose how much to wager). Here the fn : {-1,1}"7! — RZ? are 
fixed deterministic functions, collectively referred to as the gambling policy. 
(If Wn = 1 for each n, then this is equivalent to our previous gambling 
model.) 

Note that Xn = Xn-1ı +Wn Zn, with Wn and Zn independent, and with 
E(Z,) = p—q. Hence, E(X,) = E(Xn_-1) + (p — q)E(Wn). Furthermore, 
since W, > 0, therefore E(W,,) > 0. It follows that 


(a) if p = 4, then E(X,) = E(Xn_i1) = ... = E(Xo) = a, so that 
lim E(X,,) = a; 
(b) if p < 4, then E(X,) < E(Xn-1) < ... < E(Xo) = a, so that 
lim E(X,,) < a; 
(c) if p > 4, then E(X,) > E(Xn-1) > ... > E(Xo) = a, so that 


This seems simple enough, and corresponds to our intuition: if p < i then 
the player’s expected value can only decrease. End of story? 
Perhaps not. Consider the “double ’til you win” policy, defined by 


Pate 4=24,=...=Z2n-1 = —l1 


` > 2. 
0, otherwise ee 


That is, we first bet $1. Each time we lose, we double our bet on the 
succeeding turn. As soon as we win once, we bet zero from then on. 

It is easily seen that, with this gambling policy, we will be up $1 as soon 
as we win a bet. That is, letting 7 = inf{n > 1;Z, = +1}, we have that 
Xn = a+1 provided that 7 < n. Now, clearly P(t > n) = (1—p)”. Hence, 
assuming p > 0, we see that P(r < oo) = 1. It follows that P(lim Xn = 
a+1)=1, so that E(lim Xn) = a + 1. In words, with probability one we 
will gain $1 with this gambling policy, for any positive value of p, and thus 
“cheat fate”. How can this be, in light of (a) and (b) above? 

The answer, of course, is that in this case E(lim X,,) (which equals a+1) 
is not the same as lim E(X,,) (which must be < a). This is what allows us 
to “cheat fate”. On the other hand, we may need to lose an arbitrarily large 
amount of money before we win our $1, so “infinite capital” is required to 
follow this gambling policy. We show now that, if the fortunes X, must 
remain bounded (i.e., if we only have a finite amount of capital to draw on), 
then E(lim Xn) must indeed be the same as lim E(X,,). 


Theorem 7.3.1. (The bounded convergence theorem.) Let {Xn} be a 
sequence of random variables, with lim X, = X. Suppose there is K € R 
such that |X,| < K for all n € N (ie., the {Xn} are uniformly bounded). 
Then E(X) = lim E(X,,). 
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Proof. We have from the triangle inequality that 


We shall show that this last expression goes to 0 as n — oo. Indeed, fix 
e > 0, and set A, = {w € 2; |X(w) — Xn(w)| > e}. Then |X(w) — 
X,(w)| < €+ 2K14,(w), so that E(|X — X,|) < ¢+2KP(A,). Hence, 
using Proposition 3.4.1, we have that 

lim sup E(|X — X,|) < ¢€+2K limsup P(A,) 
< ¢€+2KP(limsup An) 
= €, 


since |X(w) — X,(w)| > 0 for all w € Q, so that lim sup An is the empty 
set. Hence, E(|X — Xn!) > 0, as claimed. E 


It follows immediately that, if we are gambling with p < 5, and we use 
any gambling policy which leaves the fortunes {X,,} uniformly bounded, 
then lim X, (if it exists) will have expected value equal to lim E(X,,), and 
therefore be < a. So, if we have only finite capital (no matter how large), 
then it is not possible to cheat fate. 


Finally, we consider the following question. Suppose p < $, and 0 < 
a < c. What gambling system maximises P(t, < 79), i.e. maximises the 
probability that we reach the fortune c before losing all our money? 

For example, suppose again that p = 0.49, and that a = 9700, c = 10000. 
If W,, = 1 (i.e. we bet exactly $1 each time), then we already know that 
P(t. < To) = [(0.51/0.49)97 — 1] / [(0.51/0.49)!00 — 1] = 6.1 x 10-®. 
On the other hand, if W, = 2, then this is equivalent to instead having 
a = 4850 and c = 5000, so we see that P(Te < To) = [(0.51/0.49)*°° — 1] / 
[(0.51/0.49)50°° — 1] = 2.5 x 1073, which is about 400 times better. In fact, 
if Wn = 100, then P(t: < To) = [(0.51/0.49)9” — 1] / [(0.51/0.49)'°° — 1] = 
0.885, which is a very favourable probability. 

This example suggests that, if p < 4 and we wish to maximise P(t. < 
7), then it is best to bet in larger amounts, i.e. to get the game over with as 
quickly as possible. This is indeed true. More precisely, we define bold play 
to be the gambling strategy W, = min(Xp_1, € — Xn-1), ie. the strategy 
of betting as much as possible each time. It is then a theorem that, when 
p < 4, this is the optimal strategy in the sense of maximising P(t; < To). 
For a proof, see e.g. Billingsley (1995, Theorem 7.3). 


Disclaimer. We note that for p < a even though to maximise P (Te < To) 
it is best to bet large amounts, still overall it is best not to gamble at all. 
Indeed, by (7.2.7), if p < 5 and you have any finite amount of money, then 


80 7. STOCHASTIC PROCESSES AND GAMBLING GAMES. 


if you keep betting you will eventually lose it all and go broke. This is the 
reason few probabilists attend gambling casinos! 


7.4. Exercises. 


Exercise 7.4.1. For the stochastic process {Xn} given by (7.0.1), com- 
pute (for n, k > 0) 

(a) P(X, =k). 

(b) P(t, =n). 

[Hint: These two questions do not have the same answer.] 


Exercise 7.4.2. For the stochastic process {X,} given by (7.0.2), com- 
pute (for n, k > 0) 

(a) P(X, =k). 

(b) P(X, > 0). 


Exercise 7.4.3. Prove that there exist random variables Y and Z such 
that P(Y = 1) = P(Y = -1)=P(Z=1) = P(Z = -1) = i P(Y = 
0) = P(Z = 0) = $, and such that Cov(Y,Z) = 4. (In particular, Y 
and Z are not independent.) [Hint: First use Theorem 7.1.1 to construct 
independent random variables X,, X2, and X3 each having certain two- 
point distributions. Then construct Y and Z as functions of X,, X2, and 


X3 


Exercise 7.4.4. For the gambler’s ruin model of Subsection 7.2, with 
c = 10000 and p = 0.49, find the smallest positive integer a such that 
Sc,p(a) > 5. Interpret your result in plain English. 


Exercise 7.4.5. For the gambler’s ruin model of Subsection 7.2, let 
Bn = P(min(70, Te) > n) be the probability that the player’s fortune has 
not hit 0 or c by time n. 

(a) Find any explicit, simple expression Yn such that Bn < Yn for alln € N, 
and such that limp... Yn = 0. 

(b) Find any explicit, simple expression a, such that Bn > Qn > 0 for all 
neN. 


Exercise 7.4.6. Let {W,,} be iid. with P[W, = +1] = P[W, = 0] = 1/4 
and P[W,, = —1] = 1/2, and let a be a positive integer. Let X, = a+ Wı + 
... + Wn, and let 7 = inf{n > 0; Xn = 0}. Compute P(t < oo). [Hint: 
Let {Yn } be like {Xn}, except with immediate repetitions of values omitted. 
Is {Yn} a simple random walk? With what parameter p?| 
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Exercise 7.4.7. Verify explicitly that re p(a) + Scp(a) = 1 for all a, c, 
and p. 


Exercise 7.4.8. | In gambler’s ruin, recall that {Te < To} is the event 
that the player eventually wins, and {To < Te} is the event that the player 
eventually loses. 

(a) Give a similar plain-English description of the complement of the union 
of these two events, i.e. ({Te < To} U {T0 < To})°. 

(b) Give three different proofs that the event described in part (a) has 
probability 0: one using Exercise 7.4.7; a second using Exercise 7.4.5; and 
a third recalling how the probabilities Se p(a) were computed in the text, 
and seeing to what extent the computation would have differed if we had 
instead replaced se pla) by Se pla) = P (Te < To). 

(c) Prove that, if c > 4, then the event described in part (a) contains un- 
countably many outcomes (i.e., that uncountably many different sequences 
Z1, Z2, . . . correspond to this event, even though it has probability 0). [Hint: 
This is not entirely dissimilar from the analysis of the Cantor set in Sub- 
section 2.4.] 


Exercise 7.4.9. For the gambling policies model of Subsection 7.3, 
consider the “triple ’til you win” policy defined by W1 = 1, and for n > 2, 
W, = 3°71 if Zi =... = Zn-1 = —1 otherwise W,, = 0. 

(a) Prove that, with probability 1, the limit lim,.. Xn exists. 

(b) Describe precisely the distribution of limp. Xn. 


Exercise 7.4.10. Consider the gambling policies model, with p = 1/3, 
a=6,andc=8. 

(a) Compute the probability s¢(a) that the player will win (i.e. hit c 
before hitting 0) if they bet $1 each time (i.e. if W, = 1). 

(b) Compute the probability that the player will win if they bet $2 each 
time (i.e. if W, = 2). 

(c) Compute the probability that the player will win if they employ the 
strategy of Bold Play (ie., if W, = min(Xp-1,¢ — Xn_1)). [Hint: While it 
is difficult to do explicit computations involving Bold Play in general, here 
the numbers are small enough that it is not difficult.] 


7.5. Section summary. 


This section introduced the concept of stochastic processes, from the 
point of view of gambling games. It proved a general theorem about the 
existence of sequences of independent random variables having arbitrary 
specified distributions. It used this to define several models of gambling 
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games. If the player bets $1 each time, and has independent probability p of 
winning each bet, then explicit formulae were developed for the probability 
that the player achieves some specified target fortune c before losing all 
their money. The formula is very sensitive to values of p near Z. 

If the player is allowed to choose how much to bet at each stage, then 
with clever betting they can “cheat fate” and always win. However, this 
requires them to have infinite capital available. If their capital is finite, 
and p < Ł, then on average they will always lose money by the Bounded 


Convergence Theorem. 
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8. Discrete Markov chains. 


In this section, we consider the general notion of a (discrete time, discrete 
space, time-homogenecous) Markov chain. 

A Markov chain is characterised by three ingredients: a state space S 
which is a finite or countable set; an initial distribution {v;i }ies consisting of 
non-negative numbers summing to 1; and transition probabilities {p;;}ijes 
consisting of non-negative numbers with >> jes Pij = 1 for each i € S. 

Intuitively, a Markov chain represents the random motion of some par- 
ticle moving around the space S. v; represents the probability that the 
particle starts at the point i, while p;; represents the probability that, if 
the particle is at the point 7, it will then jump to the point j on the next 
step. 

More formally, a Markov chain is defined to be a sequence of random 
variables Xo, X 1, X2,... taking values in the set S, such that 


P(X = io, X1 = i1,- -, Xn = in) = Vig Digi: Piriz «+» Pin-tin 


for any n € N and any choice of io, 71,...,in E€ S. (Note that, for these 
{Xn} to be random variables in the sense of Definition 3.1.1, we need to 
have S C R; however, if we allow more general random variables as in 
Remark 3.1.10, then this restriction is not necessary.) It then follows that, 
for example, 


P(X = 5) = X P(X% =i, M1 =j) = X vip. 
ies ies 


This also has a matrix interpretation: writing [p] for the matrix {p;,;}i jes, 
and [u\")] for the row-vector {P(Xn = i)}ies, we have [u®] = [u®] [p], 
and more generally [ut] = [yu] [p]. 

We present some simple examples of Markov chains here. Note that, 
except in the first example, we do not bother to specify initial probabili- 
ties {v;}; we shall see that initial probabilities are often not crucial when 
studying a chain’s properties. 


Example 8.0.1. (Simple random walk.) Let S = Z be the set of all 
integers. Fix a € Z, and let va = 1, with v; = 0 fori # a. Fix a real 
number p with 0 < p < 1, and let pji41 = p and pj4-1 = 1 — p for each 
i € Z, with pa = 0 if j A i +1. Thus, this Markov chain begins at the 
point a (with probability 1), and at each step either increases by 1 (with 
probability p) or decreases by 1 (with probability 1—p). It is easily seen that 
this Markov chain corresponds precisely to the gambling game of Subsection 
T2; 
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Example 8.0.2. Let S = {1,2,3} consist of just three elements, and 
define the transition probabilities {p;;} in matrix form by 


0 1/2 1/2 
1/4 1/4 1/2 


(so that p31 = L, etc.). This Markov chain jumps around on the three 
points {1,2,3} in a random and interesting way. 


Example 8.0.3. (Random walk on Z/(d).) Let S = {0,1,2,...,d — 1}, 
and define the transition probabilities by 


1 . = . . = . . = . ea L 
pa = t=jort j+1(mod d) or i = j — 1(mod a); 
0, otherwise . 


If we think of the d elements of S as arranged in a circle, then our particle, 
at each step, either stays where it is, or moves one step clockwise, or moves 
one step counter-clockwise, each with probability L. 


Example 8.0.4. (Ehrenfest’s urn.) Consider two urns, Urn 1 and Urn 2. 
Suppose there are d balls divided between the two urns. Suppose at each 
step, we choose one ball uniformly at random from among the d balls, and 
switch it to the opposite urn. We let X, be the number of balls in Urn 1 at 
time n. Thus, S = {0,1,2,...,d}, with pii-1 = i/d and pii41 = (d — i)/d, 
for 0 <i < d (with pi; = 0 if j # i + 1). Thus, this Markov chain moves 
randomly among the possible numbers {0,1,..., d} of balls in Urn 1 at each 
time. One might expect that, if d is large and the Markov chain is run for a 
long time, that there would most likely be approximately d/2 balls in Urn 
1. (We shall consider such questions in Subsection 8.3 below.) 


We note that we can also interpret Markov chains in terms of conditional 
probability. Recall that, if A and B are events with P(B) > 0, then the 


conditional probability of A given B is P(A|B) = SE intuitively, it is 
the probabilistic proportion of the event B which is also contained in the 


event A. Thus, for a Markov chain, if P(X; = i) > 0, then 


P(X =5 |X, =i) = PA een 


i Vig Pigiy Piz ig: Pip _gip_ 1 Pip 1iPiz 


igile the 


iQst1s- tpn ip Pioi Pix tg ++-Pip,_—gip—1Pin_1? 


= Pij. 


This formally justifies the notion of p;; as a transition probability. Note 
also that this conditional probability does not depend on the starting time 
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k, justifying the notion of time homogeneity. (The only reason we do not 
take this conditional probability as our definition of a Markov chain, is that 
the conditional probability is not well-defined if P(X; = 7) = 0.) 

We can similarly compute, for any n € N, again assuming P(X; =i) > 
0, that P(Xz4n = j | Xk = i) is equal to 


D — i”). 
Pring: Pikgiinga ** t Pikpn-2ik4n-1Pik4n-1) = Pij ’ 


tht stk425-tkpn—1 


here py is an n* order transition probability. Note again that this prob- 


ability does not depend on the starting time k (despite appearances to the 
contrary). By convention, we set 


O s _ jL i=j 
By = oy = { 0, otherwise ’ (8.0.5) 


i.e. in zero time units the Markov chain just stays where it is. 


8.1. A Markov chain existence theorem. 


To rigorously study Markov chains, we need to be sure that they exist. 
Fortunately, this is relatively straightforward. 


Theorem 8.1.1. Given a non-empty countable set S, and non-negative 
numbers {vi}ies and {pij}ijes, with )),v; = 1 and }), pij = 1 for each 
i € S, there exists a probability triple (Q, F,P), and random variables 
Xo, X1,..- defined on (Q.,F,P), such that 


P(Xo = io, Xatj,...,Xn = in) = VigPiois +++ Pin-1in 
for alln € N and all io,...,in E€ S. 


Proof. We let (Q, F,P) be Lebesgue measure on [0,1]. We construct the 
random variables {X,,} as follows. 


1. Partition [0,1] into intervals UO hes, with length(I!°”) =i. 

2. Partition each 1 into intervals UP }ises, with length( 76?) = Vipij. 

3. Inductively partition [0, 1] into intervals U Vie i ect eBs such that 
yp € poy _, and length(I{”? ) = VioPioii Oe ae for all 


a ET N ioii -..in LEET 
neN. 
4. Define Xn by saying that X,,(w) = in ifw € 1O pe for some choice 


of ig,..-,in—1 E€ S. 
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Then it is easily verified that the random variables {Xp} have the desired 
properties. a 


We thus see that, as in Theorem 7.1.1, an “old friend” (in this case 
Lebesgue measure on [0,1]) was able to serve as a probability space on 
which to define important random variables. 


8.2. Transience, recurrence, and irreducibility. 


In this subsection, we consider some fundamental notions related to 
Markov chains. For simplicity, we shall assume that v; > 0 for alli € S, and 
shall write P;(---) as shorthand for the conditional probability P(---|Xo = 
i). Intuitively, P;(A) stands for the probability that the event A would have 
occurred, had the Markov chain been started at the state i. We shall also 
write E;(---) for expected values with respect to P,. 

To proceed, we define, for i,j E€ S and n E N, the probabilities 


f = Pi(X_ =j, but Xm #j for1<m<n-1); 


co 
fg = P(An>1;X,=7 = DAP. 
n=1 
That is, i is the probability, starting from i, that we first hit j at the 
time n; fi; is the probability, starting from i, that we ever hit j. 

A state i € S is called recurrent (or persistent) if fi; = 1, ie. if starting 
from 7 we will certainly eventually return to 7. It is called transient if it is 
not recurrent, i.e. if fi < 1. Recurrence and transience are very important 
concepts in Markov chain theory, and we prove some results about them 
here. (Recall that i.o. stands for infinitely often.) 


Theorem 8.2.1. Let {Xn} be a Markov chain, on a state space S. 
Let i € S. Then 1 is transient if and only if P;(X, = i i.o.) = 0 if and 
only if EL pe < oo. On the other hand, i is recurrent if and only if 
P;(Xn = i 1.0.) = 1 if and only if} p; pe” = 00. 

To prove this theorem, we begin with a lemma. 
Lemma 8.2.2. We have P; (#{n > 1; Xn =j} 2 k) = fij(fj;)* 1, for 
k =1,2,.... In words, the probability that, starting from i, we eventually 
hit the state j at least k times, is given by fj;(fj;)*—*. 


Proof. Starting from 7, to hit 7 at least k times is equivalent to first 
hitting 7 once (starting from 7), then to return to j at least k — 1 more 
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times (each time starting from j). The result follows. E 


Proof of Theorem 8.2.1. By continuity of probabilities, and by Lemma 
8.2.2, we have that 


P:(Xn = i i0.) = im P; (#{n > 1; Xn =i} > k) 
=>% 
os a tee J 0; fii <1 
= jim (fa) -f D frst 
This proves the first equivalence to each of transience and recurrence. 


Also, using first countable linearity, and then Proposition 4.2.9, and then 
Lemma 8.2.2, we compute that 


Drap = Ea Pi(Xn =å) 


= Ei() 2, 1x,=1) 


= Sie (fit) 


thus proving the remaining two equivalences. | 


As an application of this theorem, we consider simple symmetric ran- 
dom walk (cf. Subsection 7.2 or Example 8.0.1, with Xo = 0 and p = 5): 
We recall that here S = Z, and for any i € Z we have pe = (a72) (3)" = 
TTAF for n even (with pe = 0 for n odd). Using Sterling’s approx- 
imation, which states that n! ~ (2)" V2mn as n — oo, we compute that 
for large even n, we have pe ~ ,/2/mn. Hence, we see (cf. (A.3.7)) that 
wees p = oo. Therefore, by Theorem 8.2.1, simple symmetric random 
walk is recurrent. 

On the other hand, if p Æ 3 for simple random walk, then p 


(nya) (p(1 — p)? with p(1 — p) < ;- Sterling then gives for large even 


n that pẹ? ~ [4p(1 — p)\"/2./2/mn, with 4p(1 — p) < 1. It follows that 
P pe < oo, so that simple asymmetric random walk is not recurrent. 


In higher dimensions, suppose that S = Zt, with each of the d coordi- 
nates independently following simple symmetric random walk. Then clearly 


(m) _ 


a d 


ph?) for this process is simply the d‘" power of the corresponding quantity 
for ordinary (one-dimensional) simple symmetric random walk. Hence, for 


88 8. DISCRETE MARKOV CHAINS. 


large even n, we have p\”) ~ (2/mn)4/? in this case. It follows (cf. (A.3.7)) 


a 
that Jz: pi”) = oo if and only if d < 2. That is, higher-dimensional simple 
symmetric random walk is recurrent in dimensions 1 and 2, but transient 
in dimensions > 3, a somewhat counter-intuitive result. 


Finally, we note that for irreducible Markov chains, somewhat more 
can be said. A Markov chain is irreducible if fi; > 0 for all 1,7 € S, i.e. 
if it is possible for the chain to move from any state to any other state. 
Equivalently, the Markov chain is irreducible if for any 7,7 € S, there is 
re N with pi) > 0. (A chain is reducible if it is not irreducible, i.e. if 
fij = 0 for some i,j € S.) We can now prove 


Theorem 8.2.3. Let {pij}i,jes be the transition probabilities for an 
irreducible Markov chain on a state space S. Then the following are equiv- 
alent: 

(1) There isk € S with fy, = 1, i.e. with k recurrent. 

(2) For alli,j € S, we have fij = 1 (so, in particular, all states are recur- 
rent). 

(3) There are k,£ € S with °°, p\® = oo. 


(4) For all i,j € S, we have >, po = 00. 


If any of (1)—(4) hold, we say that the Markov chain itself is recurrent. 


Proof. That (2) implies (1) is immediate. 

That (1) implies (3) follows immediately from Theorem 8.2.1. 

To show that (3) implies (4): Assume that (3) holds, and let i,j € S. 
By irreducibility, there are m,r € N with pm) > 0 and Ph > 0. then, 


(mintr) > (mr) (n) pi) (n) pm) 


since Pij ik Pre Pej » We have oe Pij 2 Pik phe 2D pip = OO, 


as claimed. 

To show that (4) implies (2): Suppose, to the contrary of (2), that 
fij < 1 for some i,j € S. Then 1— fj; > P;(% < 7;)(1— fiz) > 0. (Here 
P,(t% < Tj) stands for the probability, starting from j, that we hit 7 before 
returning to j; and it is positive by irreducibility.) Hence, fj; < 1, so by 
Theorem 8.2.1 we have >>, pe < œ, contradicting (4). | 


For example, for simple symmetric random walk, since $, p = 00, 
this theorem says that from any state i, the walk will (with probability 1) 
eventually reach any other state 7. Thus, the walk will keep on wondering 
from state to state forever; this is related to “fluctuation theory” for random 
walks. In particular, this implies that for simple symmetric random walk, 
with probability 1 we have limsup,, Xn = œ and lim inf, Xn = —oo. 

Such considerations are related to the remarkable law of the iterated log- 
arithm. Let Xn = Z, +... + Zn define a random walk, where Z1, Z2,... 
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are i.i.d. with mean 0 and variance 1. (For example, perhaps {Xn} is sim- 
ple symmetric random walk.) Then it is a fact that with probability 1, 
lim sup,,(Xn//2nloglogn) = 1. In other words, for any € > 0, we have 
P(X, > (1+6) V2 nloglogn i.o.) = 0 and P(Xy > (1—-€)./2 nloglog n i.o.) = 
1. This gives extremely precise information about how the peaks of {Xn} 
grow for large n. For a proof see e.g. Billingsley (1995, Theorem 9.5). 


8.3. Stationary distributions and convergence. 


Given a Markov chain on a state space S, with transition probabili- 
ties {pij}, let {ti}ieg be a distribution on S (ie., m; > 0 for alli € S, 
and }o;eg™ = 1). Then {m}ies is stationary for the Markov chain if 
ies Tipij = Tj for all j € S. (In matrix form, [7][p] = [z].) 

What this means is that if we start the Markov chain in the distribution 
{m;i} (ie., P[Xo = i] = 7; for all i € S), then one time unit later the 
distribution will still be {7;} (ie., P[X1 = i] = 7; for all i € S); this is the 
reason for the terminology “stationary”. It follows immediately by induction 
that the chain will still be in the same distribution {7;} any number n steps 
later (i.e., P[Xn = i] = 7; for all i € S). Equivalently, Dig mph = 7; for 
any n € N. (In matrix form this is even clearer: [7][p]” = [7].) 


Exercise 8.3.1. A Markov chain is said to be reversible with respect to 
a distribution {7;} if, for all i,j € S, we have mp,; = 7jp;i. Prove that, if 
a Markov chain is reversible with respect to {7;}, then {7;} is a stationary 
distribution for the chain. 


For example, for random walk on Z/(d) as in Example 8.0.3, it is com- 
puted that the uniform distribution, given by m; = 1/d fori =0,1,...,d—1, 
is a stationary distribution. For Ehrenfest’s Urn (Example 8.0.4), it is com- 
puted that the binomial distribution, given by 7; = (2) 4 for i = 0,1,...,d, 
is a stationary distribution. 


Exercise 8.3.2. Verify these last two statements. 


Now, given a Markov chain with a stationary distribution, one might 
expect that if the chain is run for a long time (i.e. n — oo), that the prob- 
ability of being at a particular state 7 € S might converge to nj, regardless 
of the initial state chosen. That is, one might expect that limp... Pi{Xn = 
j) = 7; for any i,j € S. This is not true in complete generality, as the fol- 
lowing two examples show. However, we shall see in Theorem 8.3.10 below 
that this is indeed true for many Markov chains. 
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For a first example, suppose S = {1,2}, and that the transition proba- 


bilities are given by 
1 0 


That is, this Markov chain never moves at all! Hence, any distribution is 
stationary for this chain. In particular, we could take mı = m2 = 4 as a 
stationary distribution. On the other hand, we clearly have P)(X,, = 1) =1 
for all n € N, which certainly does not approach 5. Thus, this Markov chain 
does not converge to the stationary distribution {7;}. In fact, this Markov 
chain is clearly reducible (i.e., not irreducible), which is the obstacle to 
convergence. 

For a second example, suppose again that S = {1,2}, and that the 
transition probabilities are given this time by 


(oy) = (f 5): 


Again we may take m) = mm = $ as a stationary distribution (in fact, 


this time the stationary distribution is unique). Furthermore, this Markov 
chain is irreducible. On the other hand, we have Pı (Xņ„ = 1) = 1 for n even, 


and P,(X, = 1) = 0 for n odd. Hence, again we do not have P)(X, = 
1) > 3. (On ti the other hand, the Cesàro averages of Pı(Xn = 1), ie 
> an “P(X i = 1), do indeed converge to $, which is not a coincidence.) 


Here the aberadie to convergence is that the Markov chain is “periodic”, 
with period 2, as we now discuss. 


Definition 8.3.3. Given Markov chain transitions {p,;} on a state space 
S, and a state i € S, the period of i is the greatest common divisor of the 
set {n > 1; sph > O}. 


That is, the period of 7 is the g.c.d. of the times at which it is possible 
to travel from t to i. For example, if the period of 2 is 2, then this means it 
is only possible to travel from 7 to 7 in an even number of steps. (Such was 
the case for the second example above.) Clearly, if the period of a state is 
greater than 1, then this will be an obstacle to convergence to a stationary 
distribution. This prompts the following definition. 


Definition 8.3.4. A Markov chain is aperiodic if the period of each state 
is 1. (A chain which is not aperiodic is said to be periodic.) 


Before proceeding, we note a fact about periods that makes aperiodicity 
easier to verify. 
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Lemma 8.3.5. Let i and j be two states of a Markov chain, and suppose 
that fij > 0 and fji > 0 (i.e., i and j “communicate”). Then the periods 
of i and of 7 are equal. 


Proof. Since fj; > 0 and fj; > 0, there are r,s € N with pi > 0 and 
pi > 0. Since po > ppp, this implies that 


(r+n+s) 


ii 


>0 whenever pe» >0. (8.3.6) 


Now, if we let the periods of i and j be t; and tj, respectively, then (8.3.6) 


with n = 0 implies that t; divides r+ s. Then, for any n € N with pe > 0, 


(8.3.6) implies that t; divides r + n + s, hence that t; divides n. That is, t; 
is a common divisor of {n € N; pw) > O}. Since t; is the greatest common 
divisor of this set, we must have t; < tj. Similarly, tj < ti, so we must have 


ti =t;. E 
This immediately implies 


Corollary 8.3.7. Ifa Markov chain is irreducible, then all of its states 
have the same period. 


Hence, for an irreducible Markov chain, it suffices to check aperiodicity 
at any single state. 

We shall prove in Theorem 8.3.10 below that all Markov chains which 
are irreducible and aperiodic, and have a stationary distribution, do in fact 
converge to it. Before doing so, we require two further lemmas, which give 
more concrete implications of irreducibility and aperiodicity. 


Lemma 8.3.8. If a Markov chain is irreducible, and has a stationary 
distribution {7;}, then it is recurrent. 


Proof. Suppose to the contrary that the chain is not recurrent. Then, by 


Theorem 8.2.3, we have >), pe < œ for all states i and j; in particular, 


limp_soo py = 0. Now, since {7;} is stationary, we have 1; = >°; mip for 


alln € N. Hence, letting n — oo, we see (formally, by using the M-test, 
cf. Proposition A.4.8) that we must have m; = 0, for all states j. This 
contradicts the fact that }), mj = 1. | 


Remark. The converse to Lemma 8.3.8 is false, in the sense that just 
because an irreducible Markov chain is recurrent, it might not have a sta- 
tionary distribution (e.g. this is the case for simple symmetric random walk). 
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Lemma 8.3.9. Ifa Markov chain is irreducible and aperiodic, then for 
each pair (i, 7) of states, there is a number no = no(i, 7), such that pe >0 
for alln > no. 


Proof. Fix states i and j. Let T = {n > 1;p” > 0}. Then, by 
aperiodicity, gcd(T) = 1. Hence, we can find m € N, and kj,...,km ET, 
and integers b1,...,b6m, such that k1b) +...+kmbm = 1. We furthermore 
choose any a € T, and also (by irreducibility) choose c € N with pi? > 0. 
These values shall be the key to defining no. 

We now set M = ky|bi] +... + km|bm| (i-e., a sum that without the 
absolute value signs would be 1), and define no = no(i, j) = aM +c. Then, 
if n > no, then letting r = [(n — c)/a|, we can write n = c + ra + s where 
0<s<aandr> M. We then observe that, since Fai beke = 1 and 
Sz lbelke = M, we have 


n = (r-M)a + XC (albe| + sbe) ke +c, 
é=1 


where the quantities in brackets are non-negative. Hence, recalling that 


a, ke € T, and that po > 0, we have that 


pp > (oe) ji oe] p > 0, 


f=1 


as required. a 
We are now able to prove our main Markov chain convergence theorem. 


Theorem 8.3.10. If a Markov chain is irreducible and aperiodic, and 
has a stationary distribution {x;}, then for all states i and j, we have 


Proof. The proof uses the method of “coupling”. Let the original Markov 
chain have state space S, and transition probabilities {p;;}. We define a new 
Markov chain {(Xn, Y;,)}929, having state space S = S x S, and transition 
probabilities given by Puj) (ke) = PikPje- That is, our new Markov chain 
has two coordinates, each of which is an independent copy of the original 
Markov chain. 

It follows immediately that the distribution on S given by Tig) = MET 
(i.e., a product of two probabilities in the original stationary distribution) 
is in fact a stationary distribution for our new Markov chain. Furthermore, 
from Lemma 8.3.9 above, we see that our new Markov chain will again 
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be irreducible and aperiodic (indeed, we have Bey, (ke) > 0 whenever n > 
max (no(t, k), no(j,2))). Hence, from Lemma 8.3.8, we see that our new 
Markov chain is in fact recurrent. This is the key to what follows. 


To complete the proof, we choose ig € S, and set 7 = inf{n > 1; Xn = 
Yn = io}. Note that, for m < n, the quantities Puj (T = m,Xn = k) 
and Py;)(T = m,Yn = k) are equal; indeed, they both equal Pi; (T = 


m)p™. Hence, for any states i, j, and k, we have that 


n n) 
be E pi 


= |P (Xn = k) — Pog (Yn = k)| 
ee 1 Paz) (Xn =k, T =m) + Puj(Xn =k, 7 >) 
a ye Pij) (Yn = k, T= m) — Puj (Yn =k, T >n)| 


[Poi (Xn =k, T >n) -Puz (Yn =k, T>n)| 
Piiz) (T > n) 


IA Il 


(where the inequality follows since we are considering a difference of two 
positive quantities, each of which is < P(j)(7 > n)). Now, since the new 
Markov chain is irreducible and recurrent, it follows from Theorem 8.2.3 
that Fai), (ioio) = 1. That is, with probability 1, the chain will eventually 
hit the state (io, to), in which case r < oo. Hence, as n — oo, we have 


Piiz (T > n) ~> 0, so that |p -p| — 0. 


On the other hand, by stationarity, we have for the original Markov 
chain that 


pi Ons pe 


Xom (p - ri) 


kes 


< So te 


kes 


? 


and we see (again, by the M-test) that this converges to 0 as n — oo. Since 
py = P;(Xn = j), this proves the theorem. E 


Finally, we note that Theorem 8.3.10 remains true if the chain begins in 
some other initial distribution {v;} besides those with v; = 1 for some t: 


Corollary 8.3.11. If a Markov chain is irreducible and aperiodic, and 
has a stationary distribution {71,;}, then regardless of the initia] distribution, 
for all j € S, we have limy_.. P(Xn = J) = T}. 
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Proof. Using the M-test and Theorem 8.3.10, we have 


limpco P(Xn = 7) = limnoo Vices P(X0 =i, Xn = j) 
= Hmi Jeg vi Pi(Xn = 7) 
= es Vi limp oo Pi(Xn z j) 
= ies Ti 
= Tj, 


which gives the result. | 


8.4. Existence of stationary distributions. 


Theorem 8.3.10 gives powerful information about the convergence of 
irreducible and aperiodic Markov chains. However, this theorem requires 
the existence of a stationary distribution {7;}. It is reasonable to ask for 
conditions under which a stationary distribution will exist. We consider 
that question here. 

Given a Markov chain {X,,} on a state space S, and given a state i € S, 
we define the mean return time m; by m; = E; (inf{n > 1; X„ =7}). That 
is, mņ is the expected time to return to the state i. We always have m; > 1. 
If ¢ is transient, then with positive probability we will have inf{n > 1;X,, = 
i} = oo, so of course we will have m; = oo. On the other hand, if 7 is 
recurrent, then we shall call ¢ null recurrent if m; = œ, and shall call 2 
positive recurrent if m; < oo. (The names come from considering 1/m; 
rather than m.) 

The main theorem of this subsection is 


Theorem 8.4.1. Ifa Markov chain is irreducible, and if each state i of 
the Markov chain is positive recurrent with (finite) mean return time mj, 
then the Markov chain has a unique stationary distribution {7;}, given by 
mi = 1/m, for each state i. 


This theorem is rather surprising. It is not immediately clear that the 
mean return times m; have anything to do with a stationary distribution; it 
is even less expected that they provide a precise formula for the stationary 
distribution values. The proof of the theorem relies heavily on the following 
lemma. 


Lemma 8.4.2. Let Gn(i,j) = Ei HHG 1<£<n, Xe = j}) = Dh pi. 
Then for an irreducible recurrent Markov chain, 
Galii) 1 
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for any states i and 7. 
Proof. Let T/ be the time of the r*? hit of the state j. Then 


r mi 2 ak T r-1\. 
Ti =T; + (T ed) Pad, ee) 


here the r — 1 terms in brackets are i.i.d. and have mean m,;. Hence, from 
the Strong Law of Large Numbers (second version, Theorem 5.4.4), we have 
that lim,—oo afi =m, with probability 1. 

Now, for n € N, let r(n) = #{@ 1 < £ < n, X = j}. Then 
limno r(n) = œ with probability 1 by recurrence. Also clearly goo < 
n< g so that 


r(n) S r(n) S r(n) 


Hence, we must have lim, _.00 T =m, with probability 1 as well. There- 
fore, with probability 1 we have lim, —.o ra) = 1/m;. 
On the other hand, Gn (i, j) = E;(r(n)), and furthermore 0 < zm <1. 


Hence, by the bounded convergence theorem, 


n= n n= 


as claimed. | 


Remark 8.4.3. We note that this proof goes through without change in 


the case m; = œ, so that limp Sold) = 0 in that case. 


Proof of Theorem 8.4.1. We begin by proving the uniqueness. Indeed, 
suppose that 5°, aipi; = a; for all states j, for some probability distribution 


{ai}. Then, by induction, $; aps) 2 oi oe Mle eM, seara 
oa pee aup\? = Qj. Hence, 


$ t 
ap = liea ly oe 
= J; oa; limno 4 Dei py by the M-test 
= Qing by the Lemma 


That is, a; = 1/m; is the only possible stationary distribution. 
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Now, fix any state z. Then for all t € N, we have }`; pe = 1. Hence, 
if we set C =)? 1/m;, then using the Lemma and (A.4. ae we have 


, iS 
oo Tian Le! < ime =i 


In particular, C < oo. 
Again fix any state z. Then for all states 7, and for all t € N, we have 


Li pË pij = pit), Hence, again by the Lemma and (A.4.9), we have 


wed t+1 ; loa 1 
l/m; = lim =) pG” = lim Y= Do pees > 2 gPa. (844) 
i t=1 


i 


But then summing both sides over all states j, and recalling that X` j a = 


C < œ, we see that we must have equality in (8.4.4), i.e. we must have 

T =>+ m7 Pij- But this means that the probability distribution defined 

by m= 1 5 must be a stationary distribution for the Markov chain! 
Finally, by uniqueness, we must have m; = 1/m, for each state j, i.e. we 


must have C = 1. This completes the proof of the theorem. | 


On the other hand, states which are not positive recurrent cannot con- 
tribute to a stationary distribution: 


Proposition 8.4.5. If a Markov chain has a stationary distribution {n;}, 
and if a state j is not positive recurrent (i.e., satisfies mj = oo), then we 
must have nx; = 0. 


Proof. We have that >°, mipi) = 7; for allt € N. Hence, using the 
M-test and also Lemma 8.4.2 and Remark 8.4.3, we have 


Di, bl 2 (t) 
my = im, om Deol) = om im, | dnl = = Li =0, 
a — a 
as claimed. | 
Corollary 8.4.6. Ifa Markov chain has no positive recurrent states, then 
it does not have a stationary distribution. 


Proof. Suppose to the contrary that it did have a stationary distribution 
{ri}. Then from the above proposition, we would necessarily have 7; = 0 


8.4. EXISTENCE OF STATIONARY DISTRIBUTIONS. 97 


for each state j, contradicting the fact that )`, 7 = 1. E 


Theorem 8.4.1 and Corollary 8.4.6 provide clear information about Mar- 
kov chains where all states are positive recurrent, or where none are, re- 
spectively. One could still wonder about chains which have some positive 
recurrent and some non-positive-recurrent states. We now show that, for 
irreducible Markov chains, this cannot happen. The statement is somewhat 
analogous to that of Lemma 8.3.5. 


Proposition 8.4.7. Leti and j be two states of a Markov chain. Suppose 
that fi; > 0 and fji > 0 (i.e., the states i and j communicate). Then if i is 
positive recurrent, then j is also positive recurrent. 


Proof. Find r,t € N with p\ > 0 and pÊ’ > 0. Then by Lemma 8.4.2 
and Remark 8.4.3, 


(r) (t) 
i (m) O) p=) Mz B22 p$ 
= > = 
my = Jim, 5 oes > lim | 3 Pji Pi Mi Si 
m=rt+t 
so that mj < oo. a 


Corollary 8.4.8. For an irreducible Markov chain, either all states are 
positive recurrent or none are. 


Combining this corollary with Theorem 8.4.1 and Corollary 8.4.6, we see 
that 


Theorem 8.4.9. Consider an irreducible Markov chain. Then either (a) 
all states are positive recurrent, and there exists a unique stationary distri- 
bution, given by 1; = 1/m,;, to which (assuming aperiodicity) P;(Xn = j) 
converges as n — oo; or (b) no states are positive recurrent, and there does 
not exist a stationary distribution. 


For example, consider simple symmetric random walk, with state space 
the integers Z. It is clear that m; must be the same for all states j (i.e. 
no state is any “different” from any other state). Hence, it is impossible 
that. Vez 1/m; = 1. Thus, simple symmetric random walk cannot fall 
into category (a) above. Hence, simple symmetric random walk falls into 
category (b), and does not have a stationary distribution; in fact, it is null 
recurrent. 

Finally, we observe that all irreducible Markov chains on finite state 
spaces necessarily fall into category (a) above: 
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Proposition 8.4.10. For an irreducible Markov chain on a finite state 
space, all states are positive recurrent (and hence a unique stationary dis- 
tribution exists). 


Proof. Fix a state i. Write Ae = P;(X; = i for some 1 < t < m) = 


yeh He Then limm—oo hP = fji > 0, for each state j. Hence, since 


the state space is finite, we can find m € N and 6 > 0 such that ee >ô 
for all states 7. 

But then we must have 1 — ni) < (1 — ô)l”/™], so that letting T; = 
inf{n > 1; Xn = i}, we have by Proposition 4.2.9 that 


oo 


mi =X Pi 2 n+1)= La-a) Da- pjt = = < 00, | 


n=0 


8.5. Exercises. 


Exercise 8.5.1. Consider a discrete-time, time-homogeneous Markov 
chain with state space S = {1,2}, and transition probabilities given by 


pı =a, Pis=l—a, pua =1, pz =0. 


For each O< a <1, 
a) Compute p”) = P(X, = j | Xo = i) for each i,j € X and n €N. 
ij 
(b) Classify each state as recurrent or transient. 


(c) Find all stationary distributions for this Markov chain. 


Exercise 8.5.2. For any € > 0, give an example of an irreducible Markov 
chain on a countably infinite state space, such that |pi; — pix| < € for all 
states 7, j, and k. 


Exercise 8.5.3. For an arbitrary Markov chain on a state space consisting 
of exactly d states, find (with proof) the largest possible positive integer N 


such that for some states ¿ and j, we have py? > 0 but pe = 0 for all 
n< N. 


Exercise 8.5.4. Given Markov chain transition probabilities {pi;}i jes 
on a state space S, call a subset C C S closed if yee Pij = 1 for each 
i € C. Prove that a Markov chain is irreducible if and only if it has no 
closed subsets (aside from the empty set and S itself). 
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Exercise 8.5.5. Suppose we modify Example 8.0.3 so the chain moves 
one unit clockwise with probability r, or one unit counter-clockwise with 
probability 1 — r, for some 0 < r < 1. That is, S = {0,1,2,...,d—1} and 


r, j =i+1(mod d) 
Pij = l-r, j =i—1(mod d) 
0, otherwise . 


Find (with explanation) all stationary distributions of this Markov chain. 


Exercise 8.5.6. Consider the Markov chain with state space S = {1, 2,3} 
and transition probabilities p2 = p23 = p31 = 1. Let m1 = m2 = T3 = 1/8. 
(a) Determine whether or not the chain is irreducible. 

(b) Determine whether or not the chain is aperiodic. 

(c) Determine whether or not the chain is reversible with respect to {7;}. 
(d) Determine whether or not {7;} is a stationary distribution. 


(e) Determine whether or not limn—soo ps) = 7}. 


Exercise 8.5.7. Give an example of an irreducible Markov chain, and 


two distinct states 7 and 7, such that if ) 50 for allne N, and such that 
ak is not a decreasing function of n (i.e. for some n € N, i < fo), 
Exercise 8.5.8. Prove the identity fij = Pij + aren Dik fej. (Hint: 
Condition on X,.] 


Exercise 8.5.9. For each of the following transition probability matri- 
ces, determine which states are recurrent and which are transient, and also 


compute fi for each i. 
1/2 1/2 0 0 


1 0 0 0 0 0 0 

1/2 0 1⁄2 0 0 0 0 

0 1/5 4/5 0 0 0 0 

(b) 0 0 18 1⁄3 1⁄8 0 0 
1/10 0 0 0 7/10 0 1% 

0 0 0 0 0 0 1 

0 0 0 0 0O 1 0 


Exercise 8.5.10. Consider a Markov chain (not necessarily irreducible) 
on a finite state space. 
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(a) Prove that at least one state must be recurrent. 

(b) Give an example where exactly one state is recurrent (and all the rest 
are transient). 

(c) Show by example that if the state space is countably infinite then part 
(a) is no longer true. 


Exercise 8.5.11. For asymmetric one-dimensional simple random walk 
(i.e. where P(X, = Xn-1 +1) = p = 1 — P(X, = Xn-1 — 1) for some 
p # 4), provide an asymptotic upper bound for 77° vy pn ). That is, find 
an explicit expression yy, with limy—=oo Yn = 0, such that peut ps < YN 


for all sufficiently large N. 


Exercise 8.5.12. Let P = (pij) be the matrix of transition probabilities 
for a Markov chain on a finite state space. 

(a) Prove that P always has 1 as an eigenvalue. [Hint: Recall that the 
eigenvalues of P are the same whether it acts on row vectors to the left or 
on column vectors to the right.] 

(b) Suppose that v is a row eigenvector for P corresponding to the eigen- 
value 1, so that vP = v. Does v necessarily correspond to a stationary 
distribution? Why or why not? 


Exercise 8.5.13. Call a Markov chain doubly stochastic if its transi- 
tion matrix {pj }i,;¢9 has the property that $ ies Pij = 1 for each j € S. 
Prove that, for a doubly stochastic Markov chain on a finite state space 
S, the uniform distribution (i.e. m; = 1/|S| for each i € S) is a stationary 
distribution. 


Exercise 8.5.14. Give an example of a Markov chain on a finite state 
space, such that three of the states each have a different period. 


Exercise 8.5.15. Consider Ehrenfest’s Urn (Example 8.0.4). 
(a) Compute Po(X,, = 0) for n odd. 

(b) Prove that Po(X, =0) #274 asn— oœ. 

(c) Why does this not contradict Theorem 8.3.10? 


Exercise 8.5.16. Consider the Markov chain with state space S = 
{1,2,3} and transition probabilities given by 


0 2/3 1/3 
(p) = [1/4 0 3/4 
4/5 1/5 0 


(a) Find an explicit formula for P1(71 = n) for each n € N, where 7) = 
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inf{n > 1; Xn = 1}. 

(b) Compute the mean return time m; = E(71). 

(c) Prove that this Markov chain has a unique stationary distribution, to 
be called {7;}. 

(d) Compute the stationary probability 7. 


Exercise 8.5.17. | Give an example of a Markov chain for which some 
states are positive recurrent, some states are null recurrent, and some states 
are transient. 


Exercise 8.5.18. Prove that if fij; > 0 and fji = 0, then i is transient. 


Exercise 8.5.19. Prove that for a Markov chain on a finite state space, 
no states are null recurrent. [Hint: The previous exercise provides a starting 
point. | 


Exercise 8.5.20. (a) Give an example of a Markov chain on a finite 
state space which has multiple (i.e. two or more) stationary distributions. 
(b) Give an example of a reducible Markov chain on a finite state space, 
which nevertheless has a unique stationary distribution. 

(c) Suppose that a Markov chain on a finite state space is decomposable, 
meaning that the state space can be partitioned as S = S1 Ù S2, with S; 
non-empty, such that fi; = fji = 0 whenever i € Sı and j € S2. Prove that 
the chain has multiple stationary distributions. 

(d) Prove that for a Markov chain as in part (b), some states are transient. 
{Hint: Exercise 8.5.18 may help.] 


8.6. Section summary. 


This section gave a lengthy exploration of Markov chains (in discrete 
time and space). It gave several examples of Markov chains, and proved 
their existence. It defined the important concepts of transience and recur- 
rence, and proved a number of equivalences of them. For an irreducible 
Markov chain, either all states are transient or all states are recurrent. 

The section then discussed stationary distributions. It proved that the 
transition probabilities of an irreducible, aperiodic Markov chain will con- 
verge to the stationary distribution, regardless of the starting point. Finally, 
it related stationary distributions to the mean return times of the chain’s 
states, proving that (for an irreducible chain) a stationary distribution exists 
if and only if these mean return times are all finite. 
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9. More probability theorems. 


In this section, we discuss a few probability ideas that we have not 
needed so far, but which will be important for the more advanced material 
to come. 


9.1. Limit theorems. 


Suppose X,X,,X2,... are random variables defined on some proba- 
bility triple (Q, F,P). Suppose further that limp—oo Xn(w) = X(w) for 
each fixed w E€ Q (or at least for all w outside a set of probability 0), i.e. 
that {Xn} converges to X almost surely. Does it necessarily follow that 
limp oo E(Xn) = E(X)? 

We already know that this is not the case in general. Indeed, it was not 
the case for the “double ’til you win” gambling strategy of Subsection 7.3 
(where if 0 < p < 1/2, then E(X,) < a for all n, even though E(lim X,,) = 
a+1). Or for a simple counter-example, let Q = N, with P(w) = 27” 
for w € Q, and let Xn (w) = 2% dun (ie, Xn(w) = 2” if w = n, and 
equals 0 otherwise). Then {Xn} converges to 0 with probability 1, but 
E(X,) = 1 Æ 0 as n > oo. 

On the other hand, we already have two results giving conditions under 
which it is true that E(X,,) —> E(X), namely the Monotone Convergence 
Theorem (Theorem 4.2.2) and the Bounded Convergence Theorem (Theo- 
rem 7.3.1). Such limit theorems are sometimes very helpful. 

In this section, we shall establish two more similar limit theorems, 
namely the Dominated Convergence Theorem and the Uniformly Integrable 
Convergence Theorem. A first key result is 


Theorem 9.1.1. (Fatou’s Lemma) If X,, > C for all n, and some constant 
C > —o0, then 


E (lim inf Xn) < lim inf E(X;,) . 
noo no 
(We allow the possibility that both sides are infinite.) 


Proof. Set Yn = infk>n Xk, and let Y = limy..Yn = liminf Xn. 
= n—Cco 
Then Y, > C and {Yn} Z Y, and furthermore Yp < Xn. Hence, by the 


order-preserving property and the monotone convergence theorem, we have 
lim inf, E(X,,) > liminf, E(Y,,) = E(Y), as claimed. | 


Remarks. 
1. In this theorem, “lim inf,_... Xn” is of course interpreted pointwise; that 
is, its value at w is lim infn—=oo Xn(w). 
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2. For the “simple counter-example” mentioned above, it is easily verified 
that E (lim infn—oo Xn) = 0 while liminfp.. E(Xn) = 1, so the theo- 
rem gives 0 < 1 which is true. If we replace Xn by —Xn in that example, 
we instead obtain 0 < —1 which is false; however, in that case the {Xn} 
are not bounded below. 

3. For the “double ’til you win” gambling strategy of Subsection 7.3, the 
fortunes {Xn} are not bounded below. However, they are bounded above 
(by a + 1), so their negatives are bounded below. When the theorem is 
applied to their negatives, it gives that —(a + 1) < liminf E(—X,), so 
that lim sup E(X,,) < a+ 1, which is certainly true since Xn < a+ 1 for 
all n. (In fact, if p < 5, then lim sup E(X») < a.) 


It is now straightforward to prove 


Theorem 9.1.2. (The Dominated Convergence Theorem) If X, X1, X2,... 
are random variables, and if {X,}— X with probability 1, and if there is 
a random variable Y with |X,| < Y for all n and with E(Y) < oo, then 
limp—oo E(Xn) = E(X). 


Proof. We note that Y + Xn > 0. Hence, applying Fatou’s Lemma to 
{Y + Xn}, we obtain that 


E(Y) + E(X) = E(Y + X) < liminf E(Y + Xn) = E(Y) + lim inf E(X,). 


Hence, canceling the E(Y) terms (which is where we use the fact that 
E(Y) < oo), we see that E(X) < liminf, E(X,). 

Similarly, Y — Xn > 0, and applying Fatou’s Lemma to {Y — Xn}, 
we obtain that E(Y) — E(X) < E(Y) + liminf, E(-X,) = E(Y) — 
lim sup, E(X,,), so that E(X) > lim sup, E(X,). 

But we always have lim sup„ E(X,,) > liminf, E(X,). Hence, we must 
have that lim sup, E( Xn) = lim inf, E(X,) = E(X), as claimed. | 


Remark. Of course, if the random variable Y is constant, then the dom- 
inated convergence theorem reduces to the bounded convergence theorem. 


For our second new limit theorem, we need a definition. Note that 
for any random variable X with E(|X|) < œ (i.e., with X “integrable”), 
by the dominated convergence theorem limy—oo E (|X|1)x|>a) = E(0) = 0. 
Taking a supremum over n makes a collection of random variables uniformly 
integrable: 
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Definition 9.1.3. A collection {X,} of random variables is uniformly 
integrable if 
&> n 


Uniform integrability immediately implies boundedness of certain ex- 
pectations: 


Proposition 9.1.5. If {X,} is uniformly integrable, then sup„ E(|Xp|) < 
oo. Furthermore, if also {Xn} > X a.s., then E|X| < co. 


Proof. Choose œ; so that (say) sup, E (|Xn|1)x,|>a,) < 1. Then 


sup E (|X,|) = sup E (Xnllixnt<as E |Xnl|lixnlza) <S<ar+i< oœ. 
n n 


It then follows from Fatou’s Lemma that, if {Xn} > X a.s., then E(|X]|) < 
lim inf, E(|Xn|) < sup, E(|Xn|) < 00. E 


The main use of uniform integrability is given by: 


Theorem 9.1.6. (The Uniform Integrability Convergence Theorem) If 
X, Xı, X2,... are random variables, and if {Xn} — X with probability 1, 
and if {Xn} are uniformly integrable, then limp>o E(Xn) = E(X). 


Proof. Let Y, =|X,—X|, so that Yp, — 0. We shall show that E(Y,) — 0; 
it then follows from the triangle inequality that |E(X,) —E(X)| < E(Yn) > 
0, thus proving the theorem. We will consider E(Y,,) in two pieces, using 
that Yn = Ynly, <a + Ynly,>a- 

Let yo = Ynly,,<a. Then for any fixed a > 0, we have Iy] < a, 
and also yo) — 0 as n —> œ, so by the bounded convergence theorem we 


have 
lim E (ra) =0, a>0. (9.1.7) 


noo 


For the second piece, we note by the triangle inequality (again) that 
Yn < |Xn| + |X| < 2M, where Mn = max(|X,|, |X|). Hence, if Yn > a, 
then we must have M, > a/2, and thus that either |X,| > a/2 or |X| > 
a/2. This implies that 


Ynly,>a < 2Mnlm,>a/2 < 2|Xnllixn] 20/2 + 21X|Vx|>0/2- 


Taking expectations and supremums gives 


sup E (Yn 1y,.>a) < 2sup E (|X,,|1)x,,|>0/2) + 2E (|X{1)x)>0/2) . 
n n 
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Now, as a — oo, the first of these two terms goes to 0 by the uniform 
integrability assumption; and the second goes to 0 by the dominated con- 
vergence theorem (since E|X| < 00). We conclude that 


lim supE(¥,1ly,>a) =0. (9.1.8) 
a—-co n 


To finish the proof, let €e > 0. By (9.1.8), we can find ag > 0 such 
that sup, E(Y,1ly,>a,) < €/2. By (9.1.7), we can find no € N such that 


E oo) < €/2 for all n > no. Then, for any n > no, we have 


€ € 
E(Yn) =E (x) + E(Ynly,>a0) < z + 5 =. 


Hence, E(Y;,,) — 0, as desired. | 


Remark 9.1.9. While the above limit theorems were all stated for a 
countable limit of random variables, they apply equally well for continuous 
limits. For example, suppose {X;}:>0 is a continuous-parameter family 
of random variables, and that lim; o X:(w) = Xo(w) for each fixed w € 
Q. Suppose further that the family {Xz}:50 is dominated (or uniformly 
integral) as in the previous limit theorems. Then for any countable sequence 
of parameters {tn} N 0, the theorems say that E(Xz,) > E(Xo). Since 
this is true for any sequence {tn} N 0, it follows that lim; o E(X+) = E(X) 
as well. 


9.2. Differentiation of expectation. 


A classic question from multivariable calculus asks when we can “differ- 
entiate under the integral sign”. For example, is it true that 


d 1 1 1 
— e*ds = | [2e"]as = I se'ds? 


The answer is yes, as the following proposition shows. More generally, the 
proposition considers a family of random variables {F;} (e.g. Fi(w) = e*), 
and the derivative (with respect to t) of the function E(F;) (e.g. of fo e“t dw). 


Proposition 9.2.1. Let {F;}a<t<o be a collection of random variables 
with finite expectations, defined on some probability triple (Q,7,P). Sup- 
pose for each w and each a < t < b, the derivative F; (w) = ZF, (w) exists. 
Then F; is a random variable. Suppose further that there is a random vari- 
able Y on (0,F,P) with E(Y) < œ, such that |F| < Y for alla <t <b. 
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Then if we define ¢(t) = E(F;), then ¢ is differentiable, with finite derivative 
¢'(t) = E(F{) for alla<t<b. 


Proof. To see that Fj is a random variable, let tn = t+ +. Then Fy = 
limp soo n( Fi, — F}) and hence is the countable limit of random variables, 


and therefore is itself a random variable by (3.1.6). 
Next, note that we always have || < Y. Hence, using the domi- 


nated convergence theorem together with Remark 9.1.9, we have 


h—0 h h 
me eee 
and this is finite since E(|F/|) < E(Y) < oo. | 


9.3. Moment generating functions and large deviations. 


The moment generating function of a random variable X is the function 
Mx(s) = E(e**), sER. 


At first glance, it may appear that this function is of little use. However, 
we shall see that a surprising amount of information about the distribution 
of X can be obtained from Mx(s). 

If X and Y are independent random variables, then e°% and eY are 
independent by Proposition 3.2.3, so we have by (4.2.7) that 


Mx+y(s) = Mx(s)My(s), X,Y independent . (9.3.1) 


Clearly, we always have Mx (0) = 1. However, we may have Mx(s) = œ 
for certain s # 0. For example, if P|X = m] = c/m? for all integers m # 0, 
where c = (aor ge then Mx(s) = oo for all s # 0. On the other 
hand, if X ~ N(0,1), then completing the square gives that 


Mx(s) = a et ae Pda =e*/? a l ¢-(e-8)?/2¢y 


—oo Vv 27 ba V2 


= e*/2(1) =e" /?, (9.3.2) 


A key property of moment generating functions, at least when they are 
finite in a neighbourhood of 0, is the following. 
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Theorem 9.3.3. Let X be a random variable such that Mx (s) < œ for 
|s| < so, for some so > 0. Then E|X”| < 00 for all n, and Mx (s) is analytic 
for |s| < so with 


Mx(s) = X E(X”)s”/n!. 
=0 


In particular, the r*® derivative at s = 0 is given by ME? (0) = E(X"). 
Proof. The idea of the proof is that 
Mx(s) = E(e**) = E (1+ (sX) + (sX)?/2!+...) 


2 
= 14sE(X) + SE(X") +... . 


However, the final equality requires justification. For this, fix s with |s| < so, 
and let Zn = 1 + (sX) +(sX)?/2!+...+(sX)"/n!. We have to show that 
E(limnsoo Zn) = limno E(Z,). Now, for al n € N, 


[Zn| < 1+|sX|+|sX|?/2!+...+ |sX|"/n! 


< 1+|sX|+ļ|sX|2/21+... = e!l < e% pe =Y. 
Since |s| < so, we have E(Y) = Mx(s) + Mx(-—s) < œ. Hence, by the 
dominated convergence theorem, E(limn—=oo Zn) = liMpoo E(Zn). | 


Remark. Theorem 9.3.3 says that the rt? derivative of Mx at 0 equals 
the r* moment of X (thus explaining the terminology “moment gener- 
ating function”). For example, Mx(0) = 1, M%(0) = E(X), MX(0) = 
E(X?), etc. This result also follows since (4)"E(e**) = E ((2)"e**) = 
E (X"e**), where the exchange of derivative and expectation can be justi- 
fied (for |s| < so) using Proposition 9.2.1 and induction. 


We now consider the subject of large deviations. If X1, X2,... are i.i.d. 
with common mean m and finite variance v, then it follows from Cheby- 
chev’s inequality (as in the proof of the Weak Law of Large Numbers) that 
for all e > 0, P (4+4=+% >m +e) < v/ne?, which converges to 0 as 
n — oo. But how quickly is this limit reached? Does the probability really 
just decrease as O(1/n), or does it decrease faster? In fact, if the moment 
generating functions are finite in a neighbourhood of 0, then the convergence 
is exponentially fast: 


Theorem 9.3.4. Suppose X;, Xe,... are i.i.d. with common mean m, 
and with Mx,(s) < oo for —a < s < b where a,b > 0. Then 


p (St 


>m+e) < p”, nen, 
n 
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where p = info<s<b (e7°™+t9 Mx, (s)) < 1. 


This theorem gives an exponentially small upper bound on the probabil- 
ity that the average of the X; exceeds its mean by at least e. This is a (very 
simple) example of a large deviations result, and shows that in this case the 
probability is decreasing to zero exponentially quickly — much faster than 
just O(1/n). 


To prove Theorem 9.3.4, we begin with a lemma: 


Lemma 9.3.5. Let Z be a random variable with E(Z) < 0, such that 
Mz(s) < œ for —a < s < b, for some a,b > 0. Then P(Z > 0) < p, where 
p= info<s<b Mz(s) <1. 


Proof. For any 0 < s < b, since the function x +> e°” is increasing, 
Markov’s inequality implies that 


P(Z > 0)=P(e7 21) < 2 


Hence, taking the infimum over 0 < s < b, 


>0)< i =p. 
P ea 


Furthermore, since Mz(0) = 1 and M3(0) = E(Z) < 0, we must have that 
Mz(s) < 1 for all positive s sufficiently close to 0. In particular, p <1. E 


Remark. In Lemma 9.3.5, we need a > 0 to ensure that M;(0) = E(Z). 
However, the precise value of a is unimportant. 


Proof of Theorem 9.3.4. Let Y; = X;-m-—e, so E(Y;) = —e < 0. Then 
for —a < s <b, My,(s) = E(e®™) = e~ 8+) B(es*:) = e—8(™49) My, (8) < 
oo. Using Lemma 9.3.5 and (9.3.1), we have 


p (St >m+e) =p (Att >0) 
n 


= x >0)< i = i "=p", 
P (Yi +... + Yn 20) < inf Myit.ty,(s) = inf (My (s))" =p 


where p = info<cs<b My, (s) = infocscp(e78°"+® Mx, (s)). Furthermore, 
from Lemma 9.3.5, p < 1. E 
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9.4. Fubini’s Theorem and convolution. 


In multivariable calculus, an iterated integral like h h r? y’ dz dy, can 
be computed in three different ways: integrate first x and then y, or inte- 
grate first y and then x, or compute a two-dimensional “double integral” 
over the full two-dimensional region. It is well known that under mild con- 
ditions, these three different integrals will all be equal. 

A generalisation of this is Fubini’s Theorem, which allows us to compute 
expectations with respect to product measure in terms of an “iterated in- 
tegral”, where we integrate first with respect to one variable and then with 
respect to the other, in either order. (For non-negative f, the theorem is 
sometimes referred to as Tonelli’s theorem.) 


Theorem 9.4.1. (Fubini’s Theorem) Let p be a probability measure on 
X, and v a probability measure on y, and let u x v be product measure on 
Xx y. If f:X x Y—R is measurable with respect to u x v, then 


Srey Fux) = Se (Sy pan u(dx) Cs 
Sy (Sx f(z, y)u(dz) v(dy) ’ 


provided that either Jy, ftd(u x v) < 00 or fy, fo d(u xv) < 00 (or 
both). [This is guaranteed if, for example, f > C > —oo, or f < C < œ, 
or Srey |f| d(x v) < œœ. Note that we allow that the inner integrals (i.e., 
the integrals inside the brackets) may be infinite or even undefined on a set 
of probability 0.] 


The proof of Theorem 9.4.1 requires one technical lemma: 


Lemma 9.4.3. The mapping E +> fy (fy 1e(z, y)u(dx)) v(dy) is a 
well-defined, countably additive function of subsets E. 


Proof (optional). For E C ¥x y, andy E Y, let S (E) = {x € 
X; (x,y) € E}. We first argue that, for any y € Y and any set Æ which 
is measurable with respect to u x v, the set S,(£) is measurable with 
respect to u. Indeed, this is certainly true if E = A x B with A and B 
measurable, for then S,(£) is always either A or Ø. On the other hand, since 
Sy preserves set operations (i.e., Sy(E1 U E2) = Sy(E1)US,(E2), Sy(E°) = 
Sy(E)°, etc.), the collection of sets E for which S,(E) is measurable is 
a o-algebra. Furthermore, the measurable rectangles A x B generate the 
product measure’s entire o-algebra. Hence, for any measurable set E, Sy (E) 
is a measurable subset of X, so that (S, (E)) is well-defined. 

Next, consider the collection G of those measurable subsets E C ¥ x 
Y for which (S,(E)) is a measurable function of y € Y. This G cer- 
tainly contains all the measurable rectangles A x B, since then p(Sy(E)) = 
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(A) 1B(y) which is measurable. Also, G is closed under complements since 
(Sy (EF) = 11(Sy(E)°) = 1 — u(S,(E)). It is also closed under countable 
disjoint unions, since if {En} are disjoint, then so are {S,(E,,)}, and thus 
(Sy (Un En)) = (Un Sy(En)) =, u(Sy(En)). From these facts, it fol- 
lows (formally, from the “r-A theorem”, e.g. Billingsley, 1995, p. 42) that G 
includes all measurable subsets F, i.e. that u(Sy(E)) is a measurable func- 
tion of y for any measurable set Æ. Thus, integrals like h u(Sy(E)) v(dy) 
are well-defined. Since bis (Sy lg(x, y)u(dx)) v(dy) = h u(Sy(E)) v(dy), 
the claim about being well-defined follows. 

To prove countable additivity, let {En} be disjoint. Then, as above, 
u(Sy(U,, En)) = Xn u(Sy(En)). Countable additivity then follows from 
countable linearity of expectations with respect to v. | 


Proof of Theorem 9.4.1. We first consider the case where f = 1p is an 
indicator function of a measurable set E C Æ x Y. By Lemma 9.4.3, the 
mapping E + fy (fy Le(x,y)u(dz)) v(dy) is a well-defined, countably ad- 
ditive function of subsets E. When E = A x B, this integral clearly equals 
u(A)v(B) = (u x v)(E). Hence, by uniqueness of extensions (Proposi- 
tion 2.5.8), we must have fy (fy 1e(z,y)u(dx)) v(dy) = (u x v)(E) for any 


measurable subset E. Similarly, fy iG le(z, y)v(dy)) u(dx) = (uxv)(E), 


so that (9.4.2) holds whenever f = 1g. 

We complete the proof by our usual arguments. Indeed, by linear- 
ity, (9.4.2) holds whenever f is a simple function. Then, by the mono- 
tone convergence theorem, (9.4.2) holds for general measurable non-negative 
f. Finally, again by linearity since f = ft — fT, we see that (9.4.2) 
holds for general functions f as long as we avoid the co — co case where 
Sexy f dlu x v) = fury fd“ x v) = œ (while still allowing that the 
inner integrals may be oo — oo on a set of probability 0). | 


Remark. If fy 5 f?d(u xv) = fyxy f dlu x v) = 00, then Fubini’s 
Theorem might not hold; see Exercises 9.5.13 and 9.5.14. 


By additivity, Fubini’s Theorem still holds if ø and/or v are o-finite 
measures (cf. Remark 4.4.3), rather than just probability measures. In 
particular, letting 4 = P and v be counting measure on N leads to the 
following generalisation of countable linearity (essentially a re-statement of 
Exercise 4.5.14(a)): 


Corollary 9.4.4. Let 2), Z2,... be random variables with $; E|Z;| < oo. 
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As an application of Fubini’s Theorem, we consider the convolution for- 
mula, about sums of independent random variables. 


Theorem 9.4.5. Suppose X and Y are independent random variables 
with distributions u = L(X) and v = L(Y). Then the distribution of X +Y 
is given by 4 * Vv, where 


(u*v)\(H) = f na-n) v(dy), HER 


with H — y = {h — y; h € H}. Furthermore, if p has density f and v has 
density g (with respect to A = Lebesgue measure on R), then p * v has 
density f * g, where 


(f+ g(a ) = f fe-waw) Ady), TER. 


Proof. Since X and Y are independent, we know that £((X,Y)) = u xv, 
i.e. the distribution of the ordered pair (X,Y) is equal to product measure. 
Given a Borel subset H C R, let B = {(x,y) ER?; x+yeE H}. Then 
using Fubini’s Theorem, we have 


P(X+Y €H) = P((X,Y)€B) 

(u x v)(B) 

Jrxr le du x v) 

lie 1p(x,y) u(dx))v(dy) 

anes ae G, ae B} v(dy) 
p( 


EEND. 


so L(X +Y) = px»v. If p has density f and v has density g, then using 
Proposition 6.2.3, shift invariance of Lebesgue measure as in (1.2.5), and 
Fubini’s theorem again, 


Ill 


(u*v)(H) = Jy u(H —y) v(dy) 
= frl fuy! u(dz))v(dy) 
= fr (Say F2) A(dx)) g(y)A(dy) 
= Jr Sa ley) (dz) y y) 


aly) A(d 
fell fe- r gly) A(dz))A(dy) 
H(JRf ) 

Tl f*g) 


so u*v has density given by f *g. | 


= 


)g( 
ae Nada) : 
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9.5. Exercises. 


Exercise 9.5.1. For the “simple counter-example” with Q = N, P(w) = 
2-” for w € Q, and Xn(w) = 2” b,,n, verify explicitly that the hypotheses 
of each of the Monotone Convergence Theorem, the Bounded Convergence 
Theorem, the Dominated Convergence Theorem, and the Uniform Integra- 
bility Convergence Theorem, are all violated. 


Exercise 9.5.2. Give an example of a sequence of random variables which 
is unbounded but still uniformly integrable. For bonus points, make the 
sequence also be undominated, i.e. violate the hypothesis of the Dominated 
Convergence Theorem. 


Exercise 9.5.3. Let X, X1, X2,... be non-negative random variables, 
defined jointly on some probability triple (Q, F,P), each having finite ex- 
pected value. Assume that limp oo Xn(w) = X(w) for all w € Q. For 
n,K EN, let Yn, = min(X,,K). For each of the following statements, 
either prove it must true, or provide a counter-example to show it is some- 
times false. 

(a) limg 0 limp+oo E(Yn K) = E(X). 


(b) limp—soo limx 09 E(Yn x) = E(X). 
Exercise 9.5.4. Suppose that limn.o Xn(w) = 0 for all w € 2, but 
limn—oo E[X,] # 0. Prove that E(sup,, |Xn|) = co. 


Exercise 9.5.5. Suppose sup,, E(|X,|") < oo for some r > 1. Prove that 
{Xn} is uniformly integrable. [Hint: If |Xn(w)| > a > 0, then |X,(w)| < 
|Xn(w)|" /a™*] 


Exercise 9.5.6. Prove that Theorem 9.1.6 implies Theorem 9.1.2. [Hint: 
Suppose |Xn| < Y where E(Y) < oo. Prove that {Xn} satisfies (9.1.4).] 


Exercise 9.5.7. Prove that Theorem 9.1.2 implies Theorem 4.2.2, assum- 
ing that E|X| < oo. [Hint: Suppose {Xn} Z X where E|X| < œ. Prove 
that {Xn} is dominated.] 


Exercise 9.5.8. Let Q = {1,2}, with P({1}) = P({2}) = 5, and let 
F,({1}) = £ and F,({2}) = tf for0<t<1. 

(a) What does Proposition 9.2.1 conclude in this case? 

(b) In light of the above, what rule from calculus is implied by Proposi- 
tion 9.2.1? 


Example 9.5.9. Let X1, X2,... be iid., each with P(X; = 1) = P(X; = 
+1) = 1/2. 
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(a) Compute the moment generating functions Mx, (s). 
(b) Use Theorem 9.3.4 to obtain an exponentially-decreasing upper bound 
on P (4(X1 +... + Xn) > 0.1). 


Exercise 9.5.10. Let X 1, Xo,... be i.i.d., each having the standard 
normal distribution N(0,1). Use Theorem 9.3.4 to obtain an exponentially- 
decreasing upper bound on P (4+(X1 +... + Xn) > 0.1). [Hint: Don’t for- 
get (9.3.2).] 


Example 9.5.11. Let X have the distribution Exponential(5), with 
density fx(x) = 5e75? for x > 0 (with fx(x) =0 for x < 0). 

(a) Compute the moment generating function Mx (t). 

(b) Use Mx(t) to compute (with explanation) the expected value E(X). 


Example 9.5.12. Let a > 2, and let M(t) = e7!*!" for t € R. Prove that 
M(t) is not a characteristic function of any probability distribution. [Hint: 
Consider M” (t).] 


Exercise 9.5.13. Let ¥ = Y = N, and let {n} = v{n} = 2” forn € N. 
Let f: Æ x VOR by f(n,n) = (4" — 1), and f(n,n + 1) = —2(4" — 1), 
with f(n, m) = 0 otherwise. 

(a) Compute fy (h x,y) (dy)) p(dz). 


(b) Compute ty (fy f(x, y)u(da)) v(dy). 
(c) Why does the result not contradict Fubini’s Theorem? 


Exercise 9.5.14. Let À be Lebesgue measure on [0,1], and let f(x,y) = 
Say(x? — y?)(x? ty ie for (x,y) # (0,0), with f(0,0) = 0. 

(a) Compute T (r x,y )A(dy)) (dx). [Hint: Make the substitution 
u =z? +y’, v=, TE dv = dz, and x? — y? = w? — u] 

(b) Compute h (RN x,y )A(dz) A(dy). 

(c) Why does the result not contradict Fubini’s Theorem? 


Exercise 9.5.15. Let X ~ Poisson(a) and Y ~ Poisson(b) be indepen- 
dent. Let Z = X +Y. Use the convolution formula to compute P(Z = z) 
for all z € R, and prove that Z ~ Poisson(a + b). 


Exercise 9.5.16. Let X ~ N(a,v) and Y ~ N(b6,w) be independent. Let 
Z = X +Y. Use the convolution formula to prove that Z ~ N(a+b, v+w). 


Exercise 9.5.17. For a,8 > 0, the Gamma(a, 8) distribution has 
density function f(x) = 6%x%~1e~*/8/T(a) for s > 0 (with f(x) = 0 for 
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x < 0), where ['(a) = fy t?~1e~*dt is the gamma function. (Hence, when 
a = 1, Gamma(1,8) = Exp(@).) Let X ~ Gamma(a,Z) and Y ~ 
Gamma(vy, 3) be independent, and let Z = X + Y. Use the convolution 
formula to prove that Z ~ Gamma(a+ y, 8). [Note: You may use the 
facts that Tr(a +1) = aT (qa) for a € R, and T(n) = (n — 1)! for n € N, and 
fo ET (@ — t) dt = 28-1 T(r) T(s) /T(r + s) for r,s, 2 > 01] 


9.6. Section summary. 


This section presented various probability results that will be required 
for the more advanced portions of the text. First, the Dominated Conver- 
gence Theorem and the Uniform Integrability Limit Theorem were proved, 
to extend the Monotone Convergence Theorem and the Bounded Conver- 
gence Theorem studied previously, providing further conditions under which 
lim E(X,,) = E(lim Xn). Second, a result was given allowing derivative and 
expectation operators to be exchanged. Third, moment generating functions 
were introduced, and some of their basic properties were studied. Finally, 
Fubini’s Theorem for iterated integration was proved, and applied to give 
a convolution formula for the distribution of a sum of independent random 
variables. 
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10. Weak convergence. 


Given Borel probability distributions 4, 41, u2,... on R, we shall write 
Un = p, and say that {un} converges weakly to u, if fe fdun > fr fdu 
for all bounded continuous functions f :R— R. 

This is a rather natural” definition, though we draw the reader’s atten- 
tion to the fact that this convergence need hold only for continuous functions 
f (as opposed to all Borel-measurable f; cf. Proposition 3.1.8). That is, the 
“topology” of R is being used here, not just its measure-theoretic properties. 


10.1. Equivalences of weak convergence. 


We now present a number of equivalences of weak convergence (see also 
Exercise 10.3.8). For condition (2), recall that the boundary of a set AC R 
is OA = {x € R; Ve>0, AN(ax—e,n +€) 40, ACN (z— e, x +e) #0}. 


Theorem 10.1.1. The following are equivalent. 

(1) Un = u (i.e., {un} converges weakly to u); 

(2) n(A) — u( A) for all measurable sets A such that (0A) = 0; 

(3) un ((—00, x]) > u ((—00, 2]) for all x € R such that p{x} = 0; 

(4) (Skorohod’s Theorem) there are random variables Y, Y1, Y2,... defined 
jointly on some probability triple, with L(Y) = u and L(Yn) = un for each 
n € N, such that Y, — Y with probability 1. 

(5) fR fdun > Jp fd for all bounded Borel-measurable functions f : 
R — R such that (D) = 0, where Dy is the set of points where f is 
discontinuous. 


Proof. (5) => (1): Immediate. 

(5) => (2): This follows by setting f = 14, so that Ds = OA, and 
(Dj) = w(OA) = 0. Then un(A) = f f dun > f f du = (A). 

= (3): Immediate, since the boundary of (—oo, z] is {x}. 
(1) => (3): Let € > 0, and let f be the function defined by f(t) = 1 for 


t <a, f(t) = 0 for t > « +6, with f linear on the interval (x, s + €) (see 
Figure 10.1.2 (a)). Then f is continuous, with 1(.0,2) < f < 1(-c,2+q- 
Hence, 


lim sup fn ((—00, z]) < lim sup | fdun = fiau < u((—œ, x + €]) . 


“In fact, it corresponds to the weak* (“weak-star”) topology from functional analysis, 
with ¥ the set of all continuous functions on R vanishing at infinity (cf. Exercise 10.3.8), 
with norm defined by ||f|| = supper |f(z)|, and with dual space ¥* consisting of all 
finite signed Borel measures on R. The Helly Selection Principle below then follows from 
Alaoglu’s Theorem. See e.g. pages 161-2, 205, and 216 of Folland (1984). 
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This is true for any € > 0, so we conclude that lim sup,,_,o, Hn ((—00, z}) < 
p ((—00, 2]). 


f(t) g(t) 
(a) (b) 
1 1 
0 0 
Z—-€ ZL Z+E t Z—-€ XL r+e t 


Figure 10.1.2. Functions used in proof of Theorem 10.1.1. 


Similarly, if we let g be the function defined by g(t) = 1 fort < z — €, 
g(t) = 0 for t > x, with g linear on the interval (x — €, x) (see Figure 10.1.2 
(b)), then 1(_.6,0—e < 9 < 1(~.0,2}, and we obtain that 


lim inf pn ((—00, 2]) > timint | faun = [tau > u((—0,% — €]) . 


This is true for any € > 0, so we conclude that lim infn—>oo Hn ((—00, z}) > 


u ((—00, £)). 
But if w{z} = 0, then u ((—o0, x]) = u ((—20, £)), so we must have 


lim sup Hin ((—00, 2]) = lim inf un ((—00, 2]) = u ((—00, 2) , 


as claimed. 

(3) ==> (4): We first define the cumulative distribution functions, by 
F(t) = un ((—00, z]) and F(x) = pw ((—oo,z]). Then, if we let (2,7, P) 
be Lebesgue measure on [0,1], and let Y,(w) = inf{z;F,(z) > w} and 
Y (w) = inf{z; F(x) > w}, then as in Lemma 7.1.2 we have L(Yn) = un and 
L(Y) = u. Note that if F(z) < a, then Y(a) > z, while if F(w) > b, then 
Y(b) <z. 

Since {Fa} — F at most points, it seems reasonable that {Y,} — Y at 
most points. We will prove that {Y,}— Y at points of continuity of Y. 
Then, since Y is non-decreasing, it can have at most a countable number 
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of discontinuities: indeed, it has at most m(Y (n + 1) — Y (n)) < œœ discon- 
tinuities of size > 1/m within the interval (n,n + 1], then take countable 
union over m and n. Since countable sets have Lebesgue measure 0, this 
implies that {Y,}— Y with probability 1, proving (4). 

Suppose, then, that Y is continuous at w, and let y = Y(w). For any 
e€ > 0, we claim that F(y—e) < w < F(y+e). Indeed, if we had F(y—e) = w, 
then setting w = y — ¢ and b = w above, this would imply Y (w) < y — € = 
Y(w) —€, a contradiction. Or, if we had F(y +e) = w, then setting z = y+e 
and a =w + above, this would imply Y (w +6) > y +€ = Y (w) + e for all 
ô > 0, contradicting the continuity of Y at w. So, F(y—«) <w < F(y+e) 
for all € > 0. 

Next, given € > 0, find e’ with 0 < € < e such that {y — €} = 
u{y +e} =0. Then Faly — e) > Fly — e) and Faly +€) — F(y +’), so 
Faly — e) < w < Faly +€’) for all sufficiently large n. This in turn implies 
(setting first z = y — e’ and a = w above, and then w = y + € and b = w 
above) that y — € < Y,(w) < y+’, ie. |[Yn(w) — Y (w)| < € < e for all 
sufficiently large n. Hence, Yn (w) > Y (w). 

(4) = (5): Recall that if f is continuous at x, and if {£n} > x, then 
f(tn) > f(x). Hence, if {Yn} > Y and Y ¢ Dy, then {f(¥n)} > f(Y). 
It follows that P[{f(¥n)} > f(Y)] > P[{Y¥n} — Y and Y ¢ Dș]. But 
by assumption, PHY, f — Y] = 1 and P[Y ¢ D;] = u( (D?) = 1, so also 
P{{f(Yn)} > f(Y)] = 1. If f is bounded, then from the pounded conver- 
gence heere, E[f(¥n)] > E[f(VY)], ie. f f dun —> f f du, as claimed. E 


For a first example, let u be Lebesgue measure on [0,1], and let un be 
defined by fin (+) = Ł fori = 1,2,...,n. Then p is purely continuous while 
Un is purely discrete; furthermore, u(Q) = 0 while un(Q) = 1 for each n. 
On the other hand, for any 0 < x < 1, we have u((—o0,2]) = x while 
jin ((—00, 0) = |nz]/n. Hence, |pin ((—00,2]) — u ((—c0,2])| < 2 0 as 
n —> œ, so we do indeed have un = pu. (Note that OQ = [0,1] so that 
u(3Q) #0.) 

For a second example, suppose X1, X2,... are i.i.d. with finite mean nm, 
and S, = 4(Xy +...+X,). Then the weak law of large numbers says 
that for any € > 0 we have P(S, < m — €) > 0 and P (Sn < Mm + €) — 1 as 
n — oo. It follows that £(S,) > dm(-), a point mass at m. Note that it is 
not necessarily the case that P (San < m) — ôm ((—00o, m]) = 1, but this is 


no contradiction since the boundary of (—oo, m] is {m}, and 6,,{m} Æ 0. 


10.2. Connections to other convergence. 
We now explore a sufficient condition for weak convergence. 


Proposition 10.2.1. If {X,} — X in probability, then L(X,) > L(X). 
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Proof. For any € > 0, if X > z +€ and |X, — X| < €, then we must 
have X, > z. That is, {X > zt+te}N{|X, — X| < e} C {Xn > z} 
Taking complements, {X < z+e}U{|X, — X| > e} D {Xn < 2}. 
Hence, by the order-preserving property and subadditivity, P(X, < z) < 
P(X < z+6e)4+ P(|X — X,| > €). Since {Xn} — X in probability, we 
get that limsup,_.9 P(Xn < z) < P(X < z +€). Letting €e N 0 gives 
lim suPp nœ P(Xn < z) < P(X <2). 

Similarly, interchanging X and X, and replacing z with z — e in the 
above gives P(X < z — €) < P(X, < z)+P(|X — Xn] > €), or P(X, < 
z) > P(X < z—€)—P(|X —X,,| > €), so liminf P(X, < z) > P(X < z-e). 
Letting € N 0 gives lim inf P(X, < z) > P(X < 2). 

If P(X = z) = 0, then P(X < z) = P(X < z), so we must have 
lim inf P(X, < z) = limsup P(X, < z) = P(X < z), as claimed. | 


Remark. We sometimes write £(X,,) = £(X) simply as X, > X, and 
say that {Xn} converges weakly (or, in distribution) to X. 


We now have an interesting near-circle of implications. We already knew 
(Proposition 5.2.3) that if X, — X almost surely, then X, — X in prob- 
ability. We now see from Proposition 10.2.1 that this in turn implies that 
L(Xn) = £(X). And from Theorem 10.1.1(4), this implies that there are 
random variables Y,, and Y having the same laws, such that Y,, —> Y almost 
surely. 

Note that the converse to Proposition 10.2.1 is clearly false, since the 
fact that £(X,) = £(X) says nothing about the underlying relationship 
between X, and X, it only says something about their laws. For example, 
if X, X1, X2, ... are i.i.d., each equal to +1 with probability 4, then of course 
L(Xn) = L(X), but on the other hand P (|X — Xa] > 2) = $ 40, so Xn 
does not converge to X in probability or with probability 1. However, if X is 
constant then the converse to Proposition 10.2.1 does hold (Exercise 10.3.1). 

Finally, we note that Skorohod’s Theorem may be used to translate 
results involving convergence with probability 1 to results involving weak 
convergence (or, by Proposition 10.2.1, convergence in probability). For 
example, we have 


Proposition 10.2.2. Suppose L(X,) > L(X), with Xn > 0. Then 
E(X) < lim inf E(X,). 


Proof. By Skorohod’s Theorem, we can find random variables Y, and 
Y with L(Yn) = L(Xn), L(Y) = L(X), and Y, — Y with probability 1. 
Then, from Fatou’s Lemma, 


E(X) = E(Y) = E (liminf Yn) < lim inf E(Y,) =liminfE(X,). U 
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For example, if X = 0, and if P(X, =n) = + and P(X, =0) =1- H, 
then L(Xn) > C{X), and 0 = E(X) < liminf E(Xn) = 1. (In fact, here 
Xn — X in probability, as well.) 


Remark 10.2.3. We note that most of these weak convergence concepts 
have direct analogues for higher-dimensional distributions, not considered 
here; see e.g. Billingsley (1995, Section 29). 


10.3. Exercises. 


Exercise 10.3.1. Suppose L(Xn) > ôe for some c € R. Prove that {Xn} 
converges to c in probability. 


Exercise 10.3.2. Let X, Y1, Y2,... be independent random variables, 
with P(Y, = 1) = 1/n and P(Y, = 0) =1-1/n,. Let Zn = X +Yn. Prove 
that L(Z,) = L(X), ie. that the law of Z, converges weakly to the law of 
Xx. 


Exercise 10.3.3. Let un = N(0, +) be a normal distribution with 
mean 0 and variance 4. Does the sequence {un} converge weakly to some 
probability measure? If yes, to what measure? 


Exercise 10.3.4. Prove that weak limits, if they exist, are unique. That 
is, if 4, v, p1, H2, . . - are probability measures, and un > u, and also pn > v, 
then u =v. 


Exercise 10.3.5. Let un be the Poisson(n) distribution, and let u be the 
Poisson(5) distribution. Show explicitly that each of the four conditions 
of Theorem 10.1.1 are violated. 


Exercise 10.3.6. Let a1,a2,... be any sequence of non-negative real num- 
bers with }7, a; = 1. Define the discrete measure pz by u(-) = Xren 10: (-), 
where 6;(-) is a point-mass at the positive integer i. Construct a sequence 
{un } of probability measures, each having a density with respect to Lebesgue 
measure, such that un => u. 


Exercise 10.3.7. Let L(Y) = u, where p has continuous density f. For 
n € N, let Y, = |nY ] /n, and let un = L(Yn). 

(a) Describe un explicitly. 

(b) Prove that un > p. 

(c) Is un discrete, or absolutely continuous, or neither? What about u? 
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Exercise 10.3.8. Prove that the following are equivalent. 

(1) pn => p. 

(2) f f dun > f f dp for all non-negative bounded continuous f : R —> R. 

(3) f fdun — f fdu for all non-negative continuous f : R — R with 
compact support, i.e. such that there are finite a and b with f(x) = 0 for all 
xz <a and all z > b. 

(4) f f dun — f f du for all continuous f : R — R with compact support. 

(5) f f dun — f f du for all non-negative continuous f : R — R which 
vanish at infinity, i.e. such that limg——oo f (£) = limg_... f(x) = 0. 

(6) f f dun — f f du for all continuous f : R — R which vanish at infinity. 

(Hints: You may assume the fact that all continuous functions on R which 
have compact support or vanish at infinity are bounded. Then, showing 
that (1) ==> each of (4)-(6), and that each of (4)-(6) => (3), is easy. For 
(2) => (1), note that if |f| < M, then f + M is non-negative. For (3) 
=> (2), note that if f is non-negative bounded continuous and m € Z, 
then fm = f lim,m+1) is non-negative bounded with compact support and 
is “nearly” continuous; then recall Figure 10.1.2, and that f = X mez fm] 


Exercise 10.3.9. Let 0 < M < œ, and let f, fi, fo,... : [0,1] — 
[0, M] be Borel-measurable functions with h fdr= i. f,d\ = 1. Suppose 
limn fn(z) = f(x) for each fixed x € [0,1]. Define probability measures 
h, Hi, M2,... by w(A) = fa f dd and jn(A) = f, fn dd, for Borel A C [0,1]. 
Prove that pin => p. 


Exercise 10.3.10. Let f : [0,1] — (0,00) be a continuous function 
such that So fdX = 1 (where À is Lebesgue measure on [0,1]). Define 
probability measures u and {yin} by u(A) = fp fladd and p,(A) = 
din f(i/n) La(i/n) / Er- f(i/n). 

(a) Prove that un => u. [Hint: Recall Riemann sums from calculus.] 

(b) Explicitly construct random variables Y and {Y,,} so that L(Y) = p, 
L£(Yn) = Hn, and Y, — Y with probability 1. [Hint: Remember the proof 
of Theorem 10.1.1] 


10.4. Section summary. 


This section introduced the notion of weak convergence. It proved equiv- 
alences of weak convergence in terms of convergence of expectations of 
bounded continuous functions, convergence of probabilities, convergence of 
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cumulative distribution functions, and the existence of corresponding ran- 
dom variables which converge with probability 1. It proved that if random 
variables converge in probability (or with probability 1), then their laws 
converge weakly. 

Weak convergence will be very important in the next section, including 
allowing for a precise statement and proof of the Central Limit Theorem 
(Theorem 11.2.2 on page 134). 
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11. Characteristic functions. 


Given a random variable X, we define its characteristic function (or 
Fourier transform) by 


ox(t) = E(e**) = Efcos(tX)] + iE[sin(tX)], teR. 


The characteristic function is thus a function from the real numbers to the 
complex numbers. Of course, by the Change of Variable Theorem (The- 
orem 6.1.1), ¢x(t) depends only on the distribution of X. We sometimes 
write x(t) as ọ(t). 

The characteristic function is clearly very similar to the moment gener- 
ating function introduced earlier; the only difference is the appearance of 
the imaginary number i = /—1 in the exponent. However, this change is 
significant; since |e“*| = 1 for any (real) t and X, the triangle inequality 
implies that |x (¢)| < 1 < oo for all ¢ and all random variables X. This is 
quite a contrast to the case for moment generating functions, which could 
be infinite for any s 4 0. 

Like for moment generating functions, we have ¢x (0) = 1 for any X, 
and if X and Y are independent then ¢x+y(t) = ¢x(t) py (t) by (4.2.7). 
We further note that, with u = £(X), we have 


x+h) -= dx(Ol = | f (eter — e2) wae 


< eke ms eit | u( (dz) < fiel le? — 1| u(da) 


= Ji” — 1| u(dz). 


Now, as h — 0, this last quantity decreases to 0 by the bounded convergence 
theorem (since |e*** — 1| < 2). We conclude that ¢x is always a (uniformly) 
continuous function. 

The derivatives of ¢x are also straightforward. The following propo- 
sition is somewhat similar to the corresponding result for Mx(s) (The- 
orem 9.3.3), except that here we do not require a severe condition like 
“Mx (s) < co for all |s| < so”. 


Proposition 11.0.1. Suppose X is a random variable with E (|X|* ) < 
oo. Then for 0 < j < k, ¢x has finite j* derivative, given by pP (t t) = 
E [(iX)e*X]. In particular, 9) (0) = #E(X9). 


Proof. We proceed by induction on j. The case j = 0 is trivial. Assume 
now that the statement is true for j — 1. For t € R, let F; = (iX) lex, 
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so that |F/| = |(iX)se**| = |X|. Since E(|X|*) < ov, therefore also 
E(|X|?) < oo. It thus follows from Proposition 9.2.1 that 

cee | eee ere 

PO = GOR O = EEX] 


= BLS (ix) el] = E[(ix)e**]. | 


11.1. The continuity theorem. 


In this subsection we shall prove the continuity theorem for characteris- 
tic functions (Theorem 11.1.14), which says that if characteristic functions 
converge pointwise, then the corresponding distributions converge weakly: 
Lin => pif and only if ġn(t) — ọ(t) for all t. This is a very important 
result; for example, it is used to prove the central limit theorem in the 
next subsection. Unfortunately, the proof is somewhat technical; we must 
show that characteristic functions completely determine the correspond- 
ing distribution (Theorem 11.1.1 and Corollary 11.1.7 below), and must 
also establish a simple criterion for weak convergence of “tight” measures 
(Corollary 11.1.11). 

We begin with an inversion theorem, which tells how to recover infor- 
mation about a probability distribution from its characteristic function. 


Theorem 11.1.1. (Fourier inversion theorem) Let u be a Borel proba- 
bility measure on R, with characteristic function ¢(t) = fp e*®u(dx). Then 
ifa < b and p{a} = {b} = 0, then 


T p—ita _ p—itb 
itige ia -J E ik 


To0 27 J r at 
To prove Theorem 11.1.1, we use two computational lemmas. 


Lemma 11.1.2. ForT >0anda< b, 


ae 


Proof. We first note by the triangle inequality that 


e ita = e7 itb 
it 


a(t) at ulan) < 2T (b-a) < œ. 


eita _ e-itb evita _ p—itb 


at 
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bo b 
a jar = f ldr=b-a. 


evita = eth 


at 


Hence, 


hl 


ett 


dt (dx) F1 [a b — a) dt (dz) 


= [ 27 0a) (ae) = 2T (b-a). E 
JR 


Lemma 11.1.3. ForT >0andĝER, 


Pa 
lim sin(6t) 


Tc, -T t 


dt = rsign (0), (11.1.4) 


where sign (0) = 1 for 0 > 0, sign (0) = —1 for 0 < 0, and sign (0) = 0. 


Furthermore, there is M < oo such that | me (sin(6t) /t] dt] < M for all 
T>OandéeR. 


Proof (optional). When 0 = 0 both sides of (11.1.4) vanish, so assume 
6 #0. Making the substitution s = |6|t, dt = ds/|0| gives 


T loll os 
i sa ey) a eD y = = sign (0) ip ae ise 
J—T 


J-r t -jar $ 
(11.1.5) 
and hence 
lim i L = 2sign (a) | SS a (11.1.6) 
To, =f 
Furthermore, 


J tig = [Gin (f ewan) ds 
z a ( [tin sje) du = [ ([ s)e~*as du. 


Now, for u > 0, integrating by parts twice, 


I, = i (sins)e"“*ds = (—coss)e “* 
Jo 


3 f (— cos s)(—u)e “ds 
s=0 0 


F l orn s)(-u)e"as) 


= oen u(r E, 
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= 140-0- f (sins)e“*ds. 
0 


Hence, I, = 1 ~ u? Tu, so Iu =1/(1 + u?). 
We then compute that 


love} T Co CO 1 CO 
f IPS de= f Iudu = f —— du = arctan(u) 
0 S 0 0 1 + u2 u=0 


= arctan(co) —arctan(0) = 7/2-—0 = 7/2. 


Combining this with (11.1.6) gives (11.1.4). 
Finally, since convergent sequences are bounded, it follows from (11.1.4) 


that the set { foal sin(t) /t] dt} rs is bounded. It then follows from (11.1.5) 
that the set { fizi sin(0t)/t] dt} r>o, ser i8 bounded as well. | 


Proof of Theorem 11.1.1. We compute that 


T e-ita_ pith 
aa Jr a O(t) at 


2r J-T it 

= 4 flr P (fpet**u(de)) dt [by definition of (1) 

= Ł fe Sir eeo ED a u(dz) [by Fubini and Lemma 11.1.2] 
= 4 fr SA sinea) sinte) gt u(dz) [since soset) is odd]. 


Hence, we may use Lemma 11.1.3, together with the bounded convergence 
theorem and Remark 9.1.9, to conclude that 


1 T e ita — eitb 
li — —— o(t)dt 
aa a 


epee ie n {sign (z — a) — sign (x — b)] (dz) 


—oco 


= a n= [sign (x — a) — sign (x — 6)| u(dz) 


= ola} + p ((a,b)) + 540}. 


(The last equality follows because (1/2)[sign (x — a) — sign (x — b)] is equal 
to 0 if z <a or z > b; is equal to 1/2 if z =a or z = b; and is equal to 1 if 
a <x <b.) But if {a} = {b} = 0, then this is precisely equal to p ({a, b]), 
as claimed. E 


From this theorem easily follows the important 
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Corollary 11.1.7. (Fourier uniqueness theorem) Let X and Y be random 
variables. Then }x(t) = ġy(t) for allt € R if and only if L(X) = L(Y), 
i.e. if and only if X and Y have the same distribution. 


Proof. Suppose $x (t) = dy(t) for allt € R. From the theorem, we know 
that P(a < X < b) = P(a < Y < b) provided that P(X = a) = P(X = 
b) = P(Y =a) = P(Y = b) = 0, i.e. for all but countably many choices of 
a and b. But then by taking limits and using continuity of probabilities, we 
see that P(X € I) = P(Y € J) for all intervals J C R. It then follows from 
uniqueness of extensions (Proposition 2.5.8) that £(X) = L(Y). 

Conversely, if £(X) = L(Y), then Corollary 6.1.3 implies that E(e** ) 
E(e*”’), i.e. dx (t) = dy (t) for all t E€ R. 


m |! 


This last result makes the continuity theorem at least plausible. How- 
ever, to prove the continuity theorem we require some further results. 


Lemma 11.1.8. (Helly Selection Principle) Let {F,} be a sequence 
of cumulative distribution functions (i.e. F,(%) = pun ((—0o,2]) for some 
probability distribution 4n). Then there is a subsequence {F,,,}, and a 
non-decreasing right-continuous function F with 0 < F < 1, such that 
limk Fn, (£) = F(a) for all x € R such that F is continuous at x. [On 
the other hand, we might not have limzg+—. F(x) = 0 or limz. F(£) = 1.] 


Proof. Since the rationals are countable, we can write them as Q = 
{q1 q2;,...}. Since 0 < F,(q1) < 1 for all n, the Bolzano-Weierstrass theo- 
rem (see page 204) says there is at least one subsequence {eg} such that 
limp oo Fp œ (q1) exists. Then, there is a further subsequence {e2)} (i.e., 
1) is a subsequence of EY such that limyoo Fy, (q2) exists (but 
also limp oo Fp, 12) (q1) exists, since {0)} is a subsequence of {0 }), Con- 
tinuing, for each m € N there is a further subsequence fe} such that 
limp oo Fg, (my (qj) exists for j < m. 

We now define the subsequence we want by nk = eh), i.e. we take the 
kt! element of the k*® subsequence. (This trick is called the diagonalisation 
method.) Since {nx} is a subsequence of {2} from the kt point onwards, 
this ensures that limz.oo Fn, (¢) = G(q) exists for each q € Q. Since each 
Fn, is non-decreasing, therefore G is also non-decreasing. 

To continue, we set F(x) = inf{G(q); q € Q, q > z}. Then F is easily 
seen to be non-decreasing, with 0 < F(x) < 1. Furthermore, F' is right- 
continuous, since if {zn} N x then {{q EQ:q> Tn} } Z uUERQ: 
q > x}, and hence F(z,) — F(x) as in Exercise 3.6.4. Also, since G is 
non-decreasing, we have F(q) > G(q) for all q € Q. 
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Now, if F is continuous at g, then given € > 0 we can find rational 
numbers r, s, and u with r < u < x < s, and with F(s) — F(r) < e. We 
then note that 


F(x)-e < F(r) 
= BOW 
= a 
inf lim inf Fal) 
lim inf Fn, (u) since u E€ Q, u>r 


I 


lim inf F,, (x) since x > u 


IN IA IA 


lim sup Fn, (x) 
k 


IA 


lim sup Fn, (s) since s > x 
k 

G(s) 

F(s) 

F(a) +e. 


IA IA Il 


This is true for any € > 0, hence we must have 


lim inf Fn, (£) = lim sup Fn, (x) = F(x), 
k 


so that lim, Fn, (£) = F(a), as claimed. E 


Unfortunately, Lemma 11.1.8 does not ensure that lims—oo F(x) = 1 or 
lim,;—oo F(x) = 0 (see e.g. Exercise 11.5.1). To rectify this, we require a 
new notion. We say that a collection {4n} of probability measures on R. is 
tight if for all e > 0, there are a < b with un ({a,b]) > 1 — € for all n. That 
is, all of the measures give most of their mass to the same finite interval; 
mass does not “escape off to infinity”. 


Exercise 11.1.9. Prove that: 

(a) any finite collection of probability measures is tight. 

(b) the union of two tight collections of probability measures is tight. 
(c) any sub-collection of a tight collection is tight. 


We then have the following. 


Theorem 11.1.10. If {js} is a tight sequence of probability measures, 
then there is a subsequence {un,} and a probability measure u, such that 
Len, => H, Le. {Hn,} converges weakly to p. 
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Proof. Let F,(z) = pn ((—0o0, z]). Then by Lemma 11.1.8, there is a 
subsequence Fh, and a function F such that Fy, (x) —> F(x) at all continuity 
points of F. Furthermore 0< F <1. 

We now claim that F is actually a probability distribution function, i.e. 
that lim, oo F(z) = 0 and limz.. F(x) = 1. Indeed, let € > 0. Then 
using tightness, we can find points a < b which are continuity points of F, 
such that 4n ((a,b]) > 1 — for all n. But then 


lim F(x) — im F(x) > F(b) — F(a) 


= lim [Fn(b) — F,(a)] = lim pn ((a,b]) > 1—e. 


This is true forall € > 0, so we must have liMms—oo F(x)—limz-,-. F(x) = 1, 
proving the claim. 

Hence, F is indeed a probability distribution function. Thus, we can 
define the probability measure p by u ((a, b]) = F(b)— F(a) for a < b. Then 
Lin, => H by Theorem 10.1.1, and we are done. | 


A main use of this theorem comes from the following corollary. 


Corollary 11.1.11. Let {4m} be a tight sequence of probability distribu- 
tions on R. Suppose that p is the only possible weak limit of {un}, in the 
sense that whenever Hn, => v then v = n (that is, whenever a subsequence 
of the {um} converges weakly to some probability measure, then that prob- 
ability measure must be u). Then fn => p, i.e. the full sequence converges 
weakly to p. 


Proof. If un Æ p, then by Theorem 10.1.1, it is not the case that 
}n(0o, z] — p(—oo, x] for all x € R with {x} = 0. Hence, we can find 
xz € R, € > 0, and a subsequence {nx}, with {x} = 0, but with 


lun, ((—00,2]) — e((—00, 2])| > €, keN. (11.1.12) 


On the other hand, {4n,} is a subcollection of {4n} and hence tight, so 
by Theorem 11.1.10 there is a further subsequence {Pn } which converges 
weakly to some probability measure, say v. But then by hypothesis we must 
have v = p, which is a contradiction to (11.1.12). | 


Corollary 11.1.11 is nearly the last thing we need to prove the continuity 
theorem. We require just one further result, concerning a sufficient condition 
for a sequence of measures to be tight. 


Lemma 11.1.13. Let {jm} be a sequence of probability measures on 
R, with characteristic functions ¢,(t) = f e**n(dx). Suppose there is a 
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function g which is continuous at 0, such that lim, ¢,(t) = g(t) for each 
jt| < to for some tp > 0. Then {un} is tight. 


Proof. We first note that g(0) = lim, ¢,(0) = lim, 1 = 1. We then 
compute that, for y > 0, 


pin ((—00,—2] U[2, 00)) 


Jizj>2/y } Hn(de) 


S 2 frjpo/y (L- ia) Hele) 

S 2 fatzayy 1 - a Ye) Hn(der) 

= Jiat>2/y2/¥) ie (1- cos(tz)) dt un(dz) 

= fren (1/y) J2 (1 — e**) dt un(de) 

_ ify 

= 7f%,(1- on(t)) de. 
Here the first inequality uses that 1 — Ja > $ whenever |z| > 2/y, the 
poco acqneniey sce Une ie = rik = ree the second equality uses 


that S, cos(tz) dt = Zsin(yz) , the final inequality uses that 1 — cos(tx) > 0, 
the third equality uses that J, sin(tx) dt = 0, and the final equality uses 
Fubini’s theorem (which is justified since the function is bounded and hence 
has finite double-integral). 

To finish the proof, let € > 0. Since g(0) = 1 and g is continuous at 
0, we can find yo with 0 < yo < to such that |1 — g(t)| < €/4 whenever 
t| < yo. Then l4 (l — g(t))dt| < €/2. Now, ¢n(t) — g(t) for all 
|t| < yo, and |¢,(t)| < 1. Hence, by the bounded convergence theorem, we 
can find no € N such that |% aoe — ¢n(t))dt| < € for all n > no. 

Hence, pin( — Z, 2) = 1 — pn((-00, Z] U [2,00)) > 1— e for all 
n > no. It follows from the definition that {un} is tight. E 


We are now, finally, in a position to prove the continuity theorem. 


Theorem 11.1.14. (Continuity Theorem) Let u, u1, u2,... be prob- 
ability measures, with corresponding characteristic functions ġ,ģ1,@2,.... 
Then un => p if and only if ¢,(t) > ¢(t) for allt € R. In words, the proba- 
bility measures {un } converge weakly to p if and only if their characteristic 
functions converge pointwise to that of p. 


Proof. First, suppose that un = p. Then, since cos(tz) and sin(tx) are 
bounded continuous functions, we have as n — oo for each t € R that 


én(t) = f cos(tz)un(dr) +i f sin(tx) un (dz) 
= a +i f sin(tz)u(dz) 
= t). 
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Conversely, suppose that ¢,(t) — ¢(t) for each t € R. Then by Lemma 
11.1.13 (with g = ¢), the {un} are tight. Now, suppose that we have pin, > 
v for some subsequence { un, } and some measure v. Then, ae the previous 
paragraph we must have $n, (t) > ¢,(t) for all t, where ¢,(t) = f ev (dz). 
On the other hand, we know that ¢n,(t) — (t) for all t; a we must 
have ¢, = ¢. But from Fourier uniqueness (Corollary 11.1.7), this implies 
that v = p. 

Hence, we have shown that p is the only possible weak limit of the {un }. 
Therefore, from Corollary 11.1.11, we must have pin = p, as claimed. a 


11.2. The Central Limit Theorem. 


Now that we have proved the continuity theorem (Theorem 11.1.14), it 
is very easy to prove the classical central limit theorem. 

First, we compute the characteristic function for the standard normal 
distribution N (0, 1), i.e. for a random variable X having density with respect 


to Lebesgue measure given by fx(z) = Feet 2. That is, we wish to 
compute 
ox(t) = T. ete l e-t?/dy, 
= V 20 


Comparing with the computation leading to (9.3.2), we might expect that 
x(t) = Mx(it) = e)’/? = e-t’/2, This is in fact correct, and can be 
justified using theory of complex analysis. But to avoid such technicalities, 
we instead resort to a trick. 


Proposition 11.2.1. If X ~ N(0,1), then $x(t) =e /? for allt €R. 


Proof. By Proposition 9.2.1 (with F, = e“* and Y = |X|, so that 
E(Y) < œ and |F}| = |(iX)e**| = |X| < Y for all t), we can differentiate 
under the integral sign, to obtain that 


. 1 2 ee : 1 2 
V(t) = i ize”? —e* ldr = I iett —_gze* ldg. 
x (t) a = bas fan 


Integrating by parts gives that 
x (t) he (ite? nea t x(t) 
= litje Tae 6 . 
= =o V 2T i 


Hence, $y (t) = —tġx (t), so that 4, £ log éx(t )= ie Also, we know that 
log ¢x (0) = log 1 = 0. Hence, we re have log $x (t) = f (—s) ds = —t?/2, 
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whence ġx (t) =e /2. i 
We can now prove 


Theorem 11.2.2. (Central Limit Theorem) Let X1, X2,... be i.i.d. 
with finite mean m and finite variance v. Set Sn = X,+...+ Xn. Then as 


n —> œ, 
£ Sn — nm 5 
la LN, 


where uy = N(0,1) is the standard normal distribution, i.e. the distribution 


2? /2 with respect to Lebesgue measure. 


having density ie 
Proof. 
and v= 1. 
Let on(t) =E Pees be the characteristic function of S,/./n. By 
the continuity theorem (Theorem 11.1.14), and by Proposition (11.2.1), it 


suffices to show that limn dp (t) = e-* /2 for each fixed t € R. 
To this end, set ¢(t) = E(e**!). Then as n — oo, using a Taylor 
expansion and Proposition 11.0.1, 


(and do) assume that m = 0 


nlt) =E (eter st a) /Ve) 
=¢ sae ie z 
+ eB) +4 (+) 5 ((X1)?) + o(t/n) 
we m ae o(1/n)) 
> et j2 , 
as claimed. (Here o(1/n) means a quantity qn such that g,/(1/n) — 0 as 


n — oO. ee the limit holds since for any «€ > 0, for sufficiently large 
n we have qn > —e/n and also qn < €/n, so that the lim inf is > e~(/2)~¢ 


and the lim sup is < e -(¢? fA)te:) | 


Since the normal distribution has no points of positive measure, this 
theorem immediately implies (by Theorem 10.1.1) the simpler-seeming 


Corollary 11.2.3. Let {X,}, m, v, and Sn be as above. Then for each 
fixed x E€ R, 


. Sn — nm 
es = 
Jim P ( m 2) (x), (11.2.4) 
where ®(x Eea sm e-t’/2dt is the cumulative distribution function for 


the ewe normal distribution. 
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This can also be written as P(S, < nm + ynv) > (x). That is, 
X,+...+X,, is approximately equal to nm, with deviations from this value 
of order yn. For example, suppose X;, X2,... each have the Poisson(5) 
distribution. This implies that m = E(X;) = 5 and v = Var(X;) = 5. 
Hence, for each fixed x € R, we see that 


P(Xi+...+Xn <5n+2V5n) > (z), n>. 


Remarks. 
1. It is not essential in the Central Limit Theorem to divide by yv. With- 
out doing so, the theorem asserts instead that 


c (So) => N(0,v). 


2. Going backwards, the Central Limit Theorem in turn implies the WLLN, 
since if y > m, then as n — oo, 


P((Sn/n) < y] = P[Sn < ny] = P[(Sn —nm)/Vnv < (ny — nm)/V no] 


~ &[(ny —nm)/Vnv] = O[Vn(y — m)/ vo] > (+00) = 1, 


and similarly if y < m then P[(S,/n) < y] — ®(—co) = 0. Hence, 
L(Sn/n) = ôm, and so S/n converges to m in probability. 


11.3. Generalisations of the Central Limit Theorem. 


The classical central limit theorem (Theorem 11.2.2 and Corollary 11.2.3) 
is extremely useful in many areas of science. However, it does have certain 
limitations. For example, it provides no quantitative bounds on the conver- 
gence in (11.2.4). Also, the insistence that the random variables be i.i.d. is 
sometimes too severe. 

The first of these problems is solved by the Berry-Esseen Theorem, which 
states that if X1, X2,... arei.i.d. with finite mean nm, finite positive variance 
v, and E (|X; — m|3) = p < oo, then 


P UT Aa E E 3p l 
Jun nv? 


This theorem thus provides a quantitative bound on the convergence in 
(11.2.4), depending only on the third moment. For a proof see e.g. Feller 
(1971, Section XVI.5). Note, however, that this error bound is absolute, not 


relative: as z — —oo, both p ( Siet < z) and ®(z) get small, and 
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the Berry-Esseen Theorem says less and less. In particular, the theorem 


does not assert that P (Stanem < x) decreases as O(e —z°/2) as z —> 
vn 


—oo, even though (x) does. (We have already seen in Theorem 9.3.4 that 
P (Hittite < x) often decreases as p* for some p < 1.) 

Regarding the second problem, we mention just two of many results. 
To state them, we shall consider collections {Znk; n > 1, 1 < k < rn} of 
random variables such that each row {Znk}1<k<r„, is independent, called 
triangular arrays. (If rn = n they form an actual triangle.) We shall 
assume for simplicity that E(Z,,) = 0 for each n and k. We shall further 
set o2, = E(Z?,) (assumed to be finite), Sn = Zn1+...+ Znr„, and 

= Var (Sn) = 02i +...+02,,. 

For such a triangular array, the Lindeberg Central Limit Theorem states 
that L(Sn/Sn) = N (0,1), provided that for each e€ > 0, we have 


1 
Jm, Bh Ligases] = 0: (11.3.1) 


This Lindeberg condition states, roughly, that as n — oo, the tails of the 
Znk contribute less and less to the variance of Sp. 


Exercise 11.3.2. | Consider the special case where r, = n, with Znk = 
Ta Yn where {Yn} are iid. with mean 0 and variance v < 00 (so Sn = 1). 

(a) Prove that the Lindeberg condition (11.3.1) is satisfied in this case. 
[Hint: Use the Dominated Convergence Theorem] 

(b) Prove that the Lindeberg CLT implies Theorem 11.2.2. 


This raises the question that, if (11.3.1) is not satisfied, then what other 
limiting distributions may arise? Call a distribution u a possible limit if 
there exists a triangular array as defined above, with sup, s2 < oo and 
limpoo Maxi <k<r, 0 of. = = 0 (so that no one term dominates the contribu- 
tion to Var(S;,,)), such that £(S,,) = u. Then we can ask, what distribu- 
tions are possible limits? Obviously the normal distributions N(0,v) are 
possible limits; indeed £(S,) = N(0,v) whenever (11.3.1) is satisfied and 
s2 — v. But what else? 

The answer is that the possible limits are precisely the infinitely divisible 
distributions having mean 0 and finite variance. Here a distribution p is 
called infinitely divisible if for all n € N, there is a distribution vn such that 
the n-fold convolution of v, equals u (in symbols: Vn * Vn *... * Vn = H). 
Recall that this means that, if X1, X2,...,Xn ~ VU, are independent, then 
Xi +... + Xn 

Half of this theorem is obvious; indeed, if u is infinitely divisible, then 
we can take r, = n and L(Xnk) = vn in the triangular array, to get that 
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L(Sn) => u. For a proof of the converse, see e.g. Billingsley (1995, Theorem 
28.2). 


11.4. Method of moments. 


There is another way of proving weak convergence of probability mea- 
sures, which does not explicitly use characteristic functions or the continuity 
theorem (though its proof of correctness does, through Corollary 11.1.11). 
Instead, it uses moments, as we now discuss. 

Recall that a probability distribution u on R has moments defined by 
Qk = f c®u(dz), for k = 1,2,3,.... Suppose these moments all exist and are 
all finite. Then is u the only distribution having precisely these moments? 
And, if a sequence {un} of distributions have moments which converge to 
those of u, then does it follow that un = p, i.e. that the Hn converges weakly 
to u? We shall see in this section that such conclusions hold sometimes, 
but not always. 

We shall say that a distribution p is determined by its moments if all 
its moments are finite, and if no other distribution has identical moments. 
(That is, we have f |x*|u(dx) < co for all k € N, and furthermore whenever 
f c*®u(dx) = f c*v(dz) for all k € N, then we must have v = p.) 

We first show that, for those distributions determined by their moments, 
convergence of moments implies weak convergence of distributions; this re- 
sult thus reduces the second question above to the first question. 


Theorem 11.4.1. Suppose that u is determined by its moments. Let 
{un} be a sequence of distributions, such that f z? un(dz) is finite for all 
n,k € N, and such that limn—>oo f x*un(de) = f z" u(dx) for each k € N. 
Then u = p, Le. the un converge weakly to p. 


Proof. We first claim that {xn} is tight. Indeed, since the moments 
converge to finite quantities, we can find K € R with f x" un(dz) < Kk 
for all n € N. But then, by Markov’s inequality, letting Yn ~ Hn, we have 


Hn ([-R, R]) = P (|Yn| < R) 
1- P (|Yn] > R) 
1-— P (Y? > R?) 
1 — (E[Y,7] / R°) 
1- (K2/R?), 


I 


IV IV Il 


which is > 1 — e whenever R > ./Ko/e, thus proving tightness. 

We now claim that if any subsequence {,,.} converges weakly to some 
distribution v, then we must have v = u. The theorem will then follow from 
Corollary 11.1.11. 
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Indeed, suppose 4n, = v. By Skorohod’s theorem, we can find ran- 
dom variables Y and {Y,} with L(Y) = v and L(Y,) = Hn, such that 
Y, — Y with probability 1. But then also Y,* — Y* with probability 1. 
Furthermore, for k € N and a > 0, we have 


Iie 


1 
B (1-1 pza) < E (B aesa) < ZE (7-1) 


me ak) < Hak 
= ZE (Y,)"*) < -7 , 


which is independent of r, and goes to 0 as œ — oo. Hence, the {Y,*} 
are uniformly integrable. Thus, by Theorem 9.1.6, E(Y,*) > E(Y*), i.e. 
J "un, (dx) = fx*v(dz). 

But we already know that f zën, (dz) > f 2* (dr). Hence, the mo- 
ments of v and p must coincide. And, since p is determined by its moments, 
we must have v = p, as claimed. a 


This theorem leads to the question of which distributions are determined 
by their moments. Unfortunately, not all distributions are, as the following 
exercise shows. 


Exercise 11.4.2. Let f(x) = ae eee for x > 0 (with f(z) = 


EV 2T 
0 for z < 0) be the density function for the random variable e*, where 


X ~ N(0,1). Let g(x) = f(x) (1+sin(27 logz)). Show that g(x) > 0 
and that fa'g(x)dx = fx*f(x)dx for k = 0,1,2,.... [Hint: Consider 
J x" f(x) sin(27 log z)dz, and make the substitution x = e%e*, dx = e%e*ds.] 
Show further that f |x|*f(a)dx < oo for all k € N. Conclude that g is 
a probability density function, and that g gives rise to the same (finite) 
moments as does f. Relate this to Theorem 11.4.1 above and Theorem 
11.4.3 below. 


On the other hand, if a distribution satisfies that its moment generating 
function is finite in a neighbourhood of the origin, then it will be determined 
by its moments, as we now show. (Unfortunately, the proof requires a result 
from complex function theory.) 


Theorem 11.4.3. Let 89 > 0, and let X be a random variable with 
moment generating function Mx(s) which is finite for |s| < so. Then L(X) 
is determined by its moments (and also by Mx(s)). 


Proof (optional). Let fx(z) = E(e?*) for z € C. Since |e?*| = eX ®¢?, 
we see that fx(z) is finite whenever |Re z| < sọ. Furthermore, just like for 
Mx/(s), it follows that fx(z) is analytic on {z € C; |Rez| < so}. Now, if 
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Y has the same moments as does X, then for |s| < sọ, we have by order- 
preserving and countable linearity that 


E (e) < E (es + e**) 


= 2B (14-4...) = (+5 )....) 


Hence, My(s) < oo for |s| < so. It now follows from Theorem 9.3.3 that 
My(s) = Mx(s) for |s| < so, ie. that fx(s) = fy(s) for real |s| < sọ. By 
the uniqueness of analytic continuation, this implies that fy(z) = fx(z) 
for |Rez| < so. In particular, since x(t) = fx(it) and ¢y(t) = fy (it), 
we have dx = dy. Hence, by the uniqueness theorem for characteristic 
functions (Theorem 11.1.7), we must have L(Y) = £(X), as claimed. | 


Remark 11.4.4. Proving weak convergence by showing convergence of 
moments is called the method of moments’. Indeed, it is possible to prove 
the central limit theorem in this manner, under appropriate assumptions. 


Remark 11.4.5. By similar reasoning, it is possible to show that if 
Mx,,(s) < œ and Mx(s) < o for all n € N and |s| < sọ, and also 
Mx,,(s) — Mx(s) for all |s| < so, then we must have £(X,) > £L(X). 


11.5. Exercises. 


Exercise 11.5.1. Let un = 6, be a point mass at n (for n = 1,2,...). 
(a) Is {un} tight? 

(b) Does there exist a subsequence {Hn, }, and a Borel probability measure 
u, such that un, = u? (If so, then specify {nx} and u.) Relate this to 
theorems from this section. 

(c) Setting F(x) = Un ((—00, 2]), does there exist a non-decreasing, right- 
continuous function F such that F,,(2) — F(x) for all continuity points x 
of F? (If so, then specify F.) Relate this to the Helly Selection Principle. 
(d) Repeat part (c) for the case where pn, = d_n is a point mass at —n. 


Exercise 11.5.2. Let Hn = Ôn mod 3 be a point mass at n mod 3. (Thus, 
Hı = 01, H2 = 42, U3 = ĝo, H4 = 41, H5 = 62, He = do, etc.) 


*This should not be confused with the statistical estimation procedure of the same 
name, which estimates unknown parameters by choosing them to make observed moments 
equal theoretical ones. 
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(a) Is {un} tight? 

(b) Does there exist a Borel probability measure jz, such that un > p? (If 
so, then specify p.) 

(c) Does there exist a subsequence {un,}, and a Borel probability measure 
u, such that un, => u? (If so, then specify {n,} and u.) 

(d) Relate parts (b) and (c) to theorems from this section. 


Exercise 11.5.3. Let {zn} be any sequence of points in the interval [0, 1]. 
Let tin = 6z,, be a point mass at £n. 

(a) Is {un} tight? 

(b) Does there exist a subsequence {jin, }, and a Borel probability measure 
u, such that un, = u? (Hint: by compactness, there must be a subsequence 
of points {£n,} which converges, say to y € [0,1]. Then what does Hn, 
converge to?) 


Exercise 11.5.4. Let plan = do, and let H2n+1 = Ôn, for n = 0,1,2,.... 
(a) Does there exist a Borel probability measure u, such that un > u? 
(b) Suppose for some subsequence {un, } and some Borel probability mea- 
sure v, we have Hn, => v. What must v be? 

(c) Relate parts (a) and (b) to Corollary 11.1.11. Why is there no contra- 
diction? 


Exercise 11.5.5. Let un = Uniform(0, n], so un (la, b]) = (b — a)/n for 
0O<asb<n. 

(a) Prove or disprove that {un} is tight. 

(b) Prove or disprove that there is some probabilty measure p such that 
Mn > h. 


Exercise 11.5.6. Suppose un = p. Prove or disprove that {un} must 
be tight. 


Exercise 11.5.7. Bein the Borel probability measure pin by Hn ({z}) = 
1/n, for x =0,4,2,.. . Let À be Lebesgue measure on [0, 1]. 
(a) Compute ¢,(t) =f eien, (dx), the characteristic function of Hn. 
(b) Compute ¢(t) = f eit (dz). the characteristic function of A. 
(c) Does ¢n(t) — (t), for each t € R? 


(d) What does the result in part (c) imply? 


Exercise 11.5.8. Use characteristic functions to provide an alternative 
solution of Exercise 10.3.2. 


Exercise 11.5.9. Use characteristic functions to provide an alternative 
solution of Exercise 10.3.3. 
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Exercise 11.5.10. Use characteristic functions to provide an alternative 
solution of Exercise 10.3.4. 


Exercise 11.5.11. Compute the characteristic function ¢x(¢), and also 
¢'y (0) =iE(X), where X follows 

(a) the binomial distribution: P(X = k) = (2) p*(1— p)""*, for k = 
0,1,2,...,n. 

(b) the Poisson distribution: P(X = k) = eae for k = 0,1,2,.... 

(c) the exponential distribution, with density with respect to Lebesgue 
measure given by fx(x) = \e~*” for z > 0, and fx(x) = 0 for x < 0. 


Exercise 11.5.12. Suppose that for n € N, we have P[X, = 5] = 1/n 
and P[X, = 6] = 1 — (1/n). 

(a) Compute the characteristic function $x, (t), for all n € N and tE R. 
(b) Compute limn—oo dx, (t). 

(c) Specify a distribution p such that limn—>oo $x, (t) = f e" u(dx) for all 
tER. 

(d) Determine (with explanation) whether or not L(Xn) > p. 


Exercise 11.5.13. Let {Xn} be i.i.d., each having mean 3 and variance 
4. Let S = Xı + X2 +... + Xı0,000. In terms of (x), give an approximate 
value for P[S < 30, 500]. 


Exercise 11.5.14. Let Xı, X2,... be i.i.d. with mean 4 and variance 
9. Find values C(n,z), for n € N and z € R, such that as n > œ, 


Exercise 11.5.15. Prove that the Poisson(,) distribution, and the 
N(m,v) (normal) distribution, are both infinitely divisible (for any À > 0, 
m € R, and v > 0). [Hint: Use Exercises 9.5.15 and 9.5.16.| 


Exercise 11.5.16. Let X be a random variable whose distribution £(X) 
is infinitely divisible. Let a > 0 and b € R, and set Y = aX +b. Prove that 
L(Y) is infinitely divisible. 


Exercise 11.5.17. Prove that the Poisson(,) distribution, the N(m, v) 
distribution, and the Exp(A) (exponential) distribution, are all determined 
by their moments, for any À > 0, m E€ R, and v > 0. 


Exercise 11.5.18. Let X, X1, X2,... be random variables which are 
uniformly bounded, i.e. there is M € R with |X| < M and |X,| < M for 
all n. Prove that {L(Xn)} > £(X) if and only if E (X$) > E(X*) for all 
KEN. 
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11.6. Section summary. 


This section introduced the characteristic function ¢x(t) = E(e#*). 
After introducing its basic properties, it proved an Inversion Theorem (to 
recover the distribution of a random variable from its characteristic func- 
tion) and a Uniqueness Theorem (which shows that if two random variables 
have the same characteristic function then they have the same distribution). 

Then, using the Helly Selection Principle and the notion of tightness, it 
proved the important Continuity Theorem, which asserts the equivalence of 
weak convergence of distributions and pointwise convergence of characteris- 
tic functions. This important theorem was used to prove the Central Limit 
Theorem about weak convergence of averages of i.i.d. random variables to 
a normal distribution. Some generalisations of the Central Limit Theorem 
were briefly discussed. 

The section ended with a discussion of the method of moments, an al- 
ternative way of proving weak convergence of random variables using only 
their moments, but not their characteristic functions. 
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12. Decomposition of probability laws. 


Let u be a Borel probability measure on R. Recall that u is discrete if it 
takes all its mass at individual points, i.e. if } `- er u{z} = u(R). Also p is 
absolutely ae if aN is a ae ep Borel-measurable function 
f such that u(A) = f, f(a)A(dx) = f 14(2 (dz) for all Borel sets A, 
where A is oe measure on R. 

One could ask, are these the only possible types of probability distribu- 
tions? Of course the answer to this question is no, as u could be a mizture of 
a discrete and an absolutely continuous distribution, e.g. = 360+ 3N (0, 1). 
But can every probability distribution at least be written as such a mixture? 
Equivalently, is it true that every distribution u with no discrete component 
(i.e. which satisfies {zx} = 0 for each z € R) must necessarily be absolutely 
continuous? 

To examine this question, say that u is dominated by À (written u < à), 
if (A) = 0 whenever \(A) = 0. Then clearly any absolutely continuous 
measure p ne be EIS ie A, since whenever \(A) = 0 we would 
then have u(A) = f 14(x)f(x)A(dx) = 0. (We shall see in Corollary 12.1.2 
below that in a the oe to this statement also holds.) 

On the other hand, suppose Z1, Z2,... are i.i.d. taking the value 1 with 
probability 2/3, and the value 0 with probability 1/3. Set Y = 7] Z,27” 
(i.e. the base-2 expansion of Y is 0.Z1 Z2 ...). Further, define S C R by 


_ ig 2 
S = {2 e (th Jim, Dae) = 5} 


where d;(z) is the it? digit in the (non-terminating) base-2 expansion of z. 
Then by the strong law of large numbers, we have P(Y € S) = 1 while 
A(S) = 0. Hence, from the previous paragraph, the law of Y cannot be 
absolutely continuous. But clearly £(Y) has no discrete component. We 
conclude that L(Y) cannot be written as a mixture of a discrete and an 
absolutely continuous distribution. In fact, £(Y) is singular with respect to 
A (written L(Y) L A), meaning that there is a subset S C R with A(S) = 0 
and P(Y € S°) =0. 


12.1. Lebesgue and Hahn decompositions. 
The main result of this section is the following. 
Theorem 12.1.1. (Lebesgue Decomposition) Any probability measure 


u on R-can uniquely be decomposed as H = [disc + Hac + Hsing) Where the 
MEASUTES Udisc: Hac, ANd [sing Satisfy 
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(a) Hadise is discrete, i.e. ) ser Mdisc{t} = Mdise(R); 

(b) plac is absolutely continuous, i.e. fac(A) = f} f dà for all Borel sets A, 
for some non-negative, Borel-measurable function f, where À is Lebesgue 
measure on R; 

(c) Using is singular continuous, i.e. Lsing{x} = 0 for all x € R, but there 
is S CR with A(S) = 0 and MsingoS”) =0. 


Theorem 12.1.1 will be proved below. We first present an important 
corollary. 


Corollary 12.1.2. (Radon-Nikodym Theorem) A Borel probability 
measure u is absolutely continuous (i.e. there is f with 4(A) = f, f dd for 
all Borel A) if and only if it is dominated by X (i.e. p & A, ie. (A) = 0 
whenever (A) = 0). 


Proof. We have already seen that if u is absolutely continuous then 
LEÀ. 

For the converse, suppose u < À, and let p = Udise + Hac + Hsing aS in 
Theorem 12.1.1. Since A{x} = 0 for each x, we must have {x} = 0 as well, 
so that Udisc = 0. Similarly, if S is such that A(S) = 0 and psing(S°) = 0, 
then we must have a(S) = 0, so that psing(S) = 0, so that Hsing = 0. 
Hence, u = Mac, i.e. is absolutely continuous. 


Remark 12.1.3. If (A) = f, f dà for all Borel A, we write ae = f, and 
call f the density, or Radon-Nikodym derivative, of p with respect to A. 


It remains to prove Theorem 12.1.1. We begin with a lemma. 


Lemma 12.1.4. (Hahn Decomposition) Let ¢ be a finite “signed mea- 
sure” on (Q, F), i.e. 6 = p—v for some finite measures p and v. Then there 
is a partition Q = At Ù AT, with A+, A~ € F, such that $(E) > 0 for all 
EC At, and $(E) <0 forall EC A`. 


Proof. Following Billingsley (1995, Theorem 32.1), we set 
a = sup{¢(A); AEF}. 


We shall construct a subset At such that (At) = a. Once we have 
done this, then we can then take AT = Q \ At. Then if E C At but 
P(E) < 0, then ¢(At \ E) = ¢(At) —4(E) > ¢(At) =a, which contradicts 
the definition of a. Similarly, if E C A` but ¢(E) > 0, then d(At U 
E) > ¢(At) + ¢(E) > a, again contradicting the definition of a. Thus, to 
complete the proof, it suffices to construct At with ¢(At) =a. 
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To that end, choose (by the definition of a) subsets A1, Ag,... € F such 
that @(An) — a. Let A = Aj, and let 


Gn = Aas each Ai = Ax aaa] 


k=1 


(so that Gn contains < 2” different subsets, which are all disjoint). Then 


let 
Cn= |] S, 


SEGn 
$(S)20 


i.e. Cn is the union of those elements of Gn with non-negative measure under 
o. Finally, set At = lim sup Cn. We claim that ¢(At) =a. 

First, note that since A, is a union of certain particular elements of 
Gn (namely, all those formed with A‘, = An), and Cn is a union of all the 
¢-positive elements of Gn, it follows that ¢(Cn) > (An). 

Next, note that Cm U.. .UChn is formed from C,,U...UC,_1 by including 
some additional ¢-positive elements of Gn, so (Cm UCm41U...U Ch) > 
O(CmUCm4iU...UCnp_1). It follows by induction that (Cm U...UCpn) > 
o(Cm) > (Am). Since this holds for all n, we must have (by continuity of 
probabilities on u and v separately) that 6(Cm U Cm41 U...) > (Am) for 
all m. 

But then (again by continuity of probabilities on yw and v separately) we 
have 


(AT) = (lim sup Cnr) = jim (Cm UCmsiU...) > lim ọ(Am) =Q. 


mc 


Hence, (At) = a, as claimed. E 


Remarks. 

1. The Hahn decomposition is unique up to sets of ¢-measure 0, i.e. if 
At+UA™ and BtU B- are two Hahn decompositions then ¢( At \ Bt) = 
¢(Bt \ At) = (A7 \ BT) = ọ(B7 \ A7) = 0, since e.g. At \ Bt = 
At N B- must have ¢-measure both > 0 and < 0. 

2. Using the Radon-Nikodym theorem, it is very easy to prove the Hahn 
decomposition; indeed, we can let € = u +v, let f = a and g = T, 
and set At = {f > g}. However, this reasoning would be circular, 
since we are going to use the Hahn decomposition to prove the Lebesgue 
decomposition which in turn proves the Radon-Nikodym theorem! 

3. If @ is any countably additive mapping from F to R (not necessarily 
non-negative, nor assumed to be of the form u — v), then continuity 
of probabilities follows just as in Proposition 3.3.1, and the proof of 


Theorem 12.1.4 then goes through without change. Furthermore, it then 
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follows that ¢ = u—v where u(E) = ¢(ENAT) and (E) = —¢( ENA), 
i.e. every such function is in fact a signed measure. 


We are now able to prove our main theorem. 


Proof of Theorem 12.1.1. We first take care of Hdise. Indeed, clearly . 
we shall define pidisc(A) = J rea FAX}, and then H — aise has no discrete 
component. Hence, we assume from now on that u has no discrete compo- 
nent. 

Second, we note that by countable additivity, it suffices to assume that 
u is supported on [0,1], so that we may also take À to be Lebesgue measure 
on [0,1]. 

To continue, call a function g a candidate density if g > 0 and f ggd < 
u(E) for all Borel sets E. We note that if gı and g2 are candidate densities, 
then so is max(gı, g2), since 


f max(gı, g2) dà = gı art f g2 dà 
E En{gi292} En{g1<g2} 


< MEN {g1 > 92}) + MEN {g1 < 92}) = HE). 


Also, by the monotone convergence theorem, if hı, h2,... are candidate 
densities and hn 7 h, then h is also a candidate density. It follows from 
these two observations that if g1,92,... are candidate densities, then so is 
SUP, Jn = liMn—+oo Max(g1,---,9n)- 

Now, let 8 = sup{ Jion gd\; g a candidate density}. Choose candidate 
densities gn with Sho, gndr > B- 1, and let f = sup,>1 Jn, to obtain that 
f is a candidate density with Jio, f dà = ĝ, i.e. f is (up to a set of measure 
0) the largest possible candidate density. 

This f shall be our density for Hac. That is, we define ftac(A) = f4 f dà. 
We then (of course) define Using(A) = u(A) — Hac( A). Since f was a can- 
didate density, therefore Hsing( A) > 0. To complete the existence proof, it 
suffices to show that Hsing is singular. 

For each n € N, let [0,1] = A} Ò A}; be a Hahn decomposition (cf. 
Lemma 12.1.4) for the signed measure ¢n = Hsing — +. Set M =U, Ay. 
Then M? = (), AZ, so that MO C A; for each n. It follows that 
(Using — +A) (MË) < 0 for all n, so that psing(M°) < +A(M°) for all 
n. Hence, Hsing MC) = 0. We claim that A(M) = 0, so that musing is 
indeed singular. To prove this, we assume that \(M) > 0, and derive a 
contradiction. 

If A(M) > 0, then there is n € N with (A) > 0. For this n, we have 
(Hsing — 4A) (E) > 0, ie. using E) > 2A(E), for all E C At. We now 
claim that the function g = f +71 At is a candidate density. Indeed, we 
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compute for any Borel set E that 


Seg 4A Sg fda + $ fe tag ar 
Hac(E) + 2 (AZ EB) 
Hac(E) + Hsing A} NE) 
Hacl E) + Hsing E) 
ME), 


I IATA Il 


thus verifying the claim. 
On the other hand, we have 


J ga= f parse | Lede GSB: 
[0,1] [0,1] n Jio ” n 

which contradicts the maximality of f. Hence, we must actually have had 
A(M) = 0, showing that Hsing must actually be singular, thus proving the 
existence part of the theorem. 

Finally, we prove the uniqueness. Indeed, suppose p = jac + Hsing = 
Vac + Using, With pac(A) = f4 f dà and vac( A) = fa gdà. Since Hsing and 
Vsing are singular, we can find Sı and S with A(S1) = A(S2) = 0 and 
hsingl SE) = Vsing( S$) = 0. Let S = S1 U S2, and let B = {w € SF; f(w) < 
g(w)}. Then g — f > 0 on B, but fa(g — f)dA = pac(B) — vac(B) = 
u(B) —v(B) = 0. Hence, A(B) = 0. But we also have A(S) = 0, hence 
A{f < g} =0. Similarly A{f > g} = 0. We conclude that A{f = g} = 1, 
whence flac = Vac, Whence Hsing = Vsing- | 


Remark. Note that, while Hac is unique, the density f = dhas is only 
unique up to a set of measure 0. 


12.2. Decomposition with general measures. 


Finally, we note that similar decompositions may be made with respect 
to other measures. Instead of considering absolute continuity or singularity 
with respect to À, we can consider them with respect to any other prob- 
ability measure v, and the same proofs apply. Furthermore, by countable 
additivity, the above proofs go through virtually unchanged for o-finite v as 
well (cf. Remark 4.4.3). We state the more general results as follows. (For 
the general Radon-Nikodym theorem, we write u < v, and say that y is 
dominated by v, if u( A) = 0 whenever v(A) = 0.) 


Theorem 12.2.1. (Lebesgue Decomposition, general case) If p and v are 
two o-finite measures on some measurable space (Q, F), then u can uniquely 
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be decomposed as jt = Hac + Hsing) Where Hac is absolutely continuous with 
respect to v (i.e., there is a non-negative measurable function f : Q > R 
such that Hac A) = Sa f dv for all A € F), and Hsing is singular (i.e., there 
is S C Q with v(S) = 0 and Hsing( SF) = 0). 


Remark. In the above statement, for simplicity we avoid mentioning 
the discrete part fdisc, thus allowing for the possibility that v may itself 
have some discrete component (in which case the corresponding discrete 
component of u would still be absolutely continuous with respect to v). 
However, if v has no discrete component, then if we wish we can extract 
Udisc from Hsing as in Theorem 12.1.1. 


Corollary 12.2.2. (Radon-Nikodym Theorem, general case) If p and 
v are two o-finite measures on some measurable space (Q, F), then p is 
absolutely continuous with respect to v if and only if u & v. 


If y < v, so p(A) = faf dv for all A € F, then we write # = f, 
and call dg the Radon-Nikodym derivative of u with respect to v. Thus, 
p(A) = MER dv. If v is a probability measure, then we can also write this 


as (A) = E, |% 14], where E, stands for expectation with respect to v. 


12.3. Exercises. 


Exercise 12.3.1. Prove that p is discrete if and only if there is a countable 
set S with u(S°) = 0. 


Exercise 12.3.2. Let X and Y be discrete random variables (not neces- 
sarily independent), and let Z = X +Y. Prove that £(Z) is discrete. 


Exercise 12.3.3. Let X be a random variable, and let Y = c X for some 
constant c > 0. 

(a) Prove or disprove that if £(X) is discrete, then L(Y) must be discrete. 
(b) Prove or disprove that if L(X) is absolutely continuous, then L(Y) 
must be absolutely continuous. 

(c) Prove or disprove that if £(X) is singular continuous, then L(Y) must 
be singular continuous. 


Exercise 12.3.4. Let X and Y be random variables, with £(Y) absolutely 
continuous, and let Z = X +Y. 

(a) Assume X and Y are independent. Prove that £(Z) is absolutely 
continuous, regardless of the nature of £(X). [Hint: Recall the convolution 
formula of Subsection 9.4.] 
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(b) Show that if X and Y are not independent, then £(Z) may fail to be 
absolutely continuous. 


Exercise 12.3.5. Let X and Y be random variables, with £(X) discrete 
and L(Y) singular continuous. Let Z = X +Y. Prove that £(Z) is singular 
continuous. [Hint: If A(S) = P(Y € S°) =0, consider the set U = {s + 7z : 
seS, P(X =z) >0}]] 


Exercise 12.3.6. Let A,B, Zi, Z2,... be i.id., each equal to +1 with 
probability 2/3, or equal to 0 with probability 1/3. Let Y = 73°, 2,27? as 
at the beginning of this section (so v = L(Y) is singular continuous), and 
let W ~ N(0,1). Finally, let X = A(BY +(1- B)W), and set p = L(X). 
Find a discrete measure [gjsc, an absolutely continuous measure Hac, and a 
singular continuous measure jis, such that 4 = pgise + Hac + Hs- 


Exercise 12.3.7. Let u, v, and p be probability measures with u < 
v< i Prove that ae = ou du with p-probability 1. [Hint: Use Proposition 
6.2.3. 


Exercise 12.3.8. Let jy: and v be probability measures with p < v and 
v & pu. (This is sometimes written as y = v.) Prove that ae > 0 with 
u-probability 1, and in fact z =1/ d, 


Exercise 12.3.9. Let u and v be discrete probability measures, with 
ġ= u-v. Write down an ezplicit Hahn decomposition Q = At ÖA- for 
Q. 


Exercise 12.3.10. Let p and v be absolutely continuous probability 
measures on R (with the Borel o-algebra), with ¢ = p — v. Write down an 
explicit Hahn decomposition R = At ÔA- for ¢. 


12.4. Section summary. 


This section proved the Lebesgue Decomposition theorem, which says 
that every measure may be uniquely written as the sum of a discrete mea- 
sure, an absolutely continuous measure, and a singular continuous measure. 
From this followed the Radon-Nikodym Theorem, which gives a simple con- 
dition under which a measure is absolutely continuous. 
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13. Conditional probability and expectation. 


Conditioning is a very important concept in probability, and we consider 
it here. 

Of course, conditioning on events of positive measure is quite straight- 
forward. We have already noted that if A and B are events, with P(B) > 0, 
then we can define the conditional probability P(A | B) = P(A N B) / P(B); 
intuitively, this represents the probabilistic proportion of the event B which 
also includes the event A. More generally, if Y is a random variable, and 
if we define v by v(S) = P(Y € S|B) = P(Y € S, B)/P(B), then 
v = L(Y | B) is a probability measure, called the conditional distribution 
of Y given B. We can then define conditional expectation by E (Y | B) = 
fyv(dy). Also, L(Y 1s) = P(B) L(Y | B) + P(B°) ôo, so taking expecta- 
tions and re-arranging, 


E(Y |B) = E(Y1g)/P(B). (13.0.1) 


No serious difficulties arise. 

On the other hand, if P(B) = 0 then this approach does not work at 
all. Indeed, it is quite unclear how to define something like P(Y € S| B) in 
that case. Unfortunately, it frequently arises that we wish to condition on 
events of probability 0. 


13.1. Conditioning on a random variable. 
We being with an example. 


Example 13.1.1. Let (X,Y) be uniformly distributed on the triangle 
T = {(a,y) € R? 0 < y < 2, y < x < 2}; see Figure 13.1.2. (That 
is, P ((X, Y) € S) = $A2(S NT) for Borel S C R?, where àz is Lebesgue 
measure on R?; briefly, dP = 5 17 dz dy.) Then what is P (Y > 3|X =1)? 
What is E (Y | X = 1)? Since P(X = 1) = 0, it is not clear how to proceed. 
We shall return to this example below. 


Because of this problem, we take a different approach. Given a random 
variable X, we shall consider conditional probabilities like P(A|X), and 
also conditional expected values like E(Y |X), to themselves be random 
variables. We shall think of them as functions of the “random” value X. 
This is very counter-intuitive: we are used to thinking of P(---) and E(---) 
as numbers, not random variables. However, we shall think of them as 
random variables, and we shall see that this allows us to partially resolve 
the difficulty of conditioning on sets of measure 0 (such as {X = 1} above). 
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Figure 13.1.2. The triangle T in Example 13.1.1. 


The idea is that, once we define these quantities to be random variables, 
then we can demand that they satisfy certain properties. For starters, we 
require that 


E[P(A|X)]|=P(A),  EB[E(Y | X)|=E(Y). (13.1.3) 


In words, these random variables must have the correct expected values. 
Unfortunately, this does not completely specify the distributions of the 
random variables P(A|X) and E(Y | X); indeed, there are infinitely many 
different distributions having the same mean. We shall therefore impose a 
stronger requirement. To state it, recall that if G is a sub-o-algebra (i.e. a 
o-algebra contained in the main o-algebra F), then a random variable Z is 
G-measurable if {Z < a E€ G for all z € R. (It follows that also {Z = z} = 
{Z < 2} \Un{Z < z — 4} € G.) Also, o(X) = {{X € B}: BCR Borel}. 


Definition 13.1.4. Given random variables X and Y with E|Y| < ov, 
and an event A, P(A |X) is a conditional probability of A given X if it is a 
o(X)-measurable random variable and, for any Borel S C R, we have 


E(P(A|X)1xes) = P(AN{X €S}). (13.1.5) 


Similarly, E(Y | X) is a conditional expectation of Y given X if it is a o(X)- 
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measurable random variable and, for any Borel S C R, we have 
E(E(Y | X)1lxes) =E(Y 1xes). (13.1.6) 


That is, we define conditional probabilities and expectations by specify- 
ing that certain expected values of them should assume certain values. This 
prompts some observations. 


Remarks. 

1. Requiring these quantities to be o(X)-measurable means that they are 
functions of X alone, and do not otherwise depend on the sample point 
w. Indeed, if a random variable Z is o(X)-measurable, then for each 
z € R, {Z = z} = {X € B,} for some Borel B, C R. Then Z = f(X) 
where f is defined by f(x) = z for all x € Bz, i.e. Z is a function of X. 

2. Of course, if we set S = R in (13.1.5) and (13.1.6), we obtain the special 
case (13.1.3). 

3. Since expected values are unaffected by changes on a set of measure 0, 
we see that conditional probabilities and expectations are only unique 
up to a set of measure 0. Thus, if P(X = x) = 0 for some particular 
value of x, then we may change P(A|X) on the set {X = x} without 
restriction. However, we may only change its value on a set of measure 
0; thus, for “most” values of X, the value P(A [| X) cannot change. In 
this sense, we have mostly (but not entirely) overcome the difficulty of 
Example 13.1.1. 


These conditional probabilities and expectations always exist: 


Proposition 13.1.7. Let X and Y be jointly defined random variables, 
and let A be an event. Then P(A |X) and E(Y | X) exist (though they are 
only unique up to a set of probability 0). 


Proof. We may define P(A |X) to be roa where Po and v are measures 
on o(X) defined as follows. Po is simply P restricted to o(X), and 


v(E)=P(ANE), Eeo(X). 


Note that v < Po, so by the general Radon-Nikodym Theorem (Corol- 
lary 12.2.2), ipo exists and is unique up to a set of probability 0. 


Similarly, we may define E(Y | X) to be dor — E where 


p+(E)=E(Y*1z), p~ (E)=E(Y71), Eeo(X). 


Then (13.1.5) and (13.1.6) are automatically satisfied, since e.g. 


dv dv 
E/P(A|X)1 = E|—1 = f — dP 
[P(A] X) lxes] E xes| eri 
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= f a Po = UX es) = P(AN{X€S}). E 
€ 


Example 13.1.1 continued. To better understand Definition 13.1.4, we 
return to the case where (X,Y) is assumed to be uniformly distributed over 
the triangle T = {(z,y) € R?; 0< y <2, y <x < 2}. We can then define 
P(Y > ł| X) and E(Y | X) to be what they “ought” to be, namely 


3 X-j X >3/4 X 
PY > fix) = 4 HE X234, EY |x) = ŽŽ. 
4 0, xX<3/4 2 


We then compute that, for example, 


Ty 3 
B (PY > = |X) 1xes) = J. ie i 5 Aldy) (ae) 
sn[3, 2] Jo T 
3\1 
= t=- |o Adr 
` ( J a Ma) 


3 
P(Y > -, )=f [o dz 
( 4’ sn[2,2] a Ae) 
3 
= t= = TAMT, 
ve a l ) 


thus verifying (13.1.5) for the case A = {Y > 3}. Similarly, for S G [0,2], 


while 


E(E(Y |X)ixes) = f 


x 1 
TA(SxR) 2 2 


- {5 MEOR (dx) = [ep 


1 
E(Ylxes) =f y zdy) dx 
TN(SxR) 


=f fou; y zld) A (dz) = [ pen, 


so that the two expressions in (13.1.6) are also equal. 


(dy) A(dz) 


while 


This example was quite specific, but similar ideas can be used more gen- 
erally; see Exercises 13.4.3 and 13.4.4. In summary, conditional probabilities 
and expectations always exist, and are unique up to sets of probability 0, 


13.2. CONDITIONING ON A SUB-o-ALGEBRA. 155 


but finding them may require guessing appropriate random variables and 
then verifying that they satisfy (13.1.5) and (13.1.6). 


13.2. Conditioning on a sub-o-algebra. 


So far we have considered conditioning on just a single random variable 
X. We may think of this, equivalently, as conditioning on the generated 
o-algebra, o(X). More generally, we may wish to condition on something 
other than just a single random variable X. For example, we may wish to 
condition on a collection of random variables X1, X2,..., Xn, or we may 
wish to condition on certain other events. 

Given any sub-o-algebra G (in place of o(X) above), we define the con- 
ditional probability P(A | G) and the conditional expectation E(Y |G) to be 
G-measurable random variables satisfying that 


E(P(A|G)ig) = P(ANG) (13.2.1) 


and 
E(E(Y |G)1ic) = E(Y ia) (13.2.2) 


( 

for any G € G. If G = o(X), then this definition reduces precisely to the 
previous one, with P(A | X) being shorthand for P(A|o(X)), and E(Y | X) 
being shorthand for E(Y |o(X)). Similarly, we shall write e.g. E(Y | X1, X2) 
as shorthand for E(Y |o(X1, X2)), where o(X1, X2) = {{X1 < a} N {X2 < 
b} : a,b € R} is the o-algebra generated by Xı and X2. As before, these 
conditional random variables exist (and are unique up to a set of probabil- 
ity 0) by the Radon-Nikodym Theorem. 

Two further examples may help to clarify matters. If G = {9,Q} (ie, 
G is the trivial o-algebra), then P(A|G) = P(A) and E(Y |G) = E(Y) 
are constants. On the other hand, if G = F (i.e., G is the full o-algebra), 
or equivalently if A € G and X is G-measurable, then P(A|G) = 14 and 
E(X |G) = X. Intuitively, if G is small then the conditional values cannot 
depend too much on the sample point w, but rather they must represent 
average values. On the other hand, if G is large then the conditional values 
can (and must) be very close approximations to the unconditional values. 
In brief, the larger G is, the more random are conditionals with respect to 
G. See also Exercise 13.4.1. 


Exercise 13.2.3. Let G1 and Ge be two sub-o-algebras. 

(a) Prove that if Z is Gj-measurable, and G1 C Ge, then Z is also G2- 
measurable. 

(b) Prove that if Z is G-measurable, and also Z is G’-measurable, then Z 
is (G N G’)-measurable. 
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Exercise 13.2.4. Let Sı and S2 be two disjoint events, and let G be a 
sub-o-algebra. Show that the following hold with probability 1. [Hint: Use 
proof by contradiction.] 

(a) 0< P(S, |G) <1. 

(b) P(S,US2|G) = P(S1 |G) + P(S2|9). 


Remark 13.2.5. Exercise 13.2.4 shows that conditional probabilities 
behave “essentially” like ordinary probabilities. Now, the additivity prop- 
erty (b) is only guaranteed to hold with probability 1, and the exceptional 
set of probability 0 could perhaps be different for each Sı and S2. This 
suggests that perhaps P(S |G) cannot be defined in a consistent, countably 
additive way for all S € F. However, it is a fact that if a random variable 
Y is real-valued (or, more generally, takes values in a Polish space, i.e. a 
complete separable metric space like R), then there always exist regular 
conditional distributions, which are versions of P(Y € B |G) defined pre- 
cisely (not just w.p. 1) for all Borel B in a countably additive way; see e.g. 
Theorem 10.2.2 of Dudley (1989), or Theorem 33.3 of Billingsley (1995). 


We close with two final results about conditional expectations. 


Proposition 13.2.6. Let X and Y be random variables, and let G be a 
sub-o-algebra. Suppose that E(Y) and E(XY) are finite, and furthermore 
that X is G-measurable. Then with probability 1, 


E(XY |G) = XE(Y |G). 
That is, we can “factor” X out of the conditional expectation. 


Proof. Clearly X E(Y |G) is G-measurable. Furthermore, if X = lg, 
with Go, G € G, then using the definition of E(Y |G), we have 


E(X E(Y |G) 16) = E(E(Y |G) lene.) = E (Y Lenco) = E(XY1e) , 


so that (13.2.2) holds in this case. But then by the usual linearity and 
monotone convergence arguments, (13.2.2) holds for general X. It then fol- 
lows from the definition that X E(Y |G) is indeed a version of E(XY |G). E 


For our final result, suppose that G; C G2. Note that since E(Y |G;) 
is G,-measurable, it is also Gz-measurable, so from the above discussion we 
have E(E(Y |Gi)|G2) = E(Y | G1). That is, conditioning first on G, and 
then on Go is equivalent to conditioning just on the smaller sub-c-algebra, 
G,. What is perhaps surprising is that we obtain the same result if we 
condition in the opposite order: 
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Proposition 13.2.7. Let Y be a random variable with finite mean, and 
let Gy C G2 be two sub-c-algebras. Then with probability 1, 


E(E(Y|G2)|Gi) = E(Y|%). 


That is, conditioning first on G2 and then on G; is equivalent to conditioning 
just on the smaller sub-o-algebra G4. 


Proof. By definition, E (E(Y |G2)|91) is Gj-measurable. Hence, to show 
that it is a version of E(Y | G1), it suffices to check that, for any G € G1 C Go, 
E [E (E(Y | G2) | G1) 1G] = E(Y1c). But using the definitions of E(--- | G1) 
and E(--- | G2), respectively, and recalling that G € G, and G € Go, we have 
that 

E [E (E(¥ |92) |G1) 16] = E (E(Y | G2) 16) = E (Y 1e) - E 


In the special case Gi = Go = G, we obtain that E[E(Y|G)|G] = 
E[Y |G]. In words, repeating the operation “take conditional expectation 
with respect to G” multiple times is equivalent to doing it just once. That 
is, conditional expectation is a projection operator, and can be thought of 
as “projecting” the random variable Y onto the o-algebra G. 


13.3. Conditional variance. 


Given jointly defined random variables X and Y, we can define the 
conditional variance of Y given X by 


Var(Y |X) = E[(Y —E(Y|X))?|X]. 


Intuitively, Var(Y | X) is a measure of how much uncertainty there is in Y 
even after we know X. Since X may well provide some information about 
Y, we might expect that on average Var(Y |X) < Var(Y). That is indeed 
the case. More precisely: 


Theorem 13.3.1. Let Y be a random variable, and G a sub-o-algebra, 
If Var(Y) < 00, then 


Var(Y) = E[Var(Y |9)] + Var[E(Y |9)]. 


Proof. We compute (writing m = E(Y) = E[E(Y | G)], and using (13.1.3) 
and Exercise 13.2.4) that 
Var(Y) = E[(Y —m)?] 
[EY ~ m)? |g] i 
EI(Y -EY |9) +BY |9) -m)l 
EL(Y - EY |9))?|G] +E[(EY 19) - m)’ 1g] 
+2E[E[(Y — E(Y |g))(E(Y |G) - m) |g] 
= E[Var(Y |G)] + Var[E(Y |g)| +0, 


E 
E 
E 
E 
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since using Proposition 13.2.6, we have 
E[(Y —E(¥ |G)) (EY |G)—m) |G] = (E(Y|9)—m) E[(Y -EY |9)) |9] 
= (E(Y |G) - m) [EY |6) - EY |9)] = 0. E 


For example, if G = o(X), then E[Var(Y | X)] represents the average 
uncertainty in Y once X is known, while Var[E(Y | X)] represents the un- 
certainty in Y caused by uncertainty in X. Theorem 13.3.1 asserts that the 
total variance of Y is given by the sum of these two contributions. 


13.4. Exercises. 


Exercise 13.4.1. Let A and B be events, with 0 < P(B) < 1. Let 
G = o (B) be the o-algebra generated by B. 

(a) Describe G explicitly. 

(b) Compute P(A |G) explicitly. 

(c) Relate P(A |G) to the earlier notion of P(A | B) = P(AN B) / P(B). 
(d) Similarly, for a random variable Y with finite mean, compute E(Y |G), 
and relate it to the earlier notion of E(Y | B) = E(Y 1g) / P(B). 


Exercise 13.4.2. Let G be a sub-o-algebra, and let A be any event. 
Define the random variable X to be the indicator function 1,4. Prove that 
E(X |G) = P(A|G) with probability 1. 


Exercise 13.4.3. Suppose X and Y are discrete random variables. Let 
q(x, y) = P(X = x, Y =y). 
(a) Show that with probability 1, 


X, 


[Hint: One approach is to first argue that it suffices in (13.1.6) to consider 
the case S = {z0}] 

(b) Compute P(Y = y| X). [Hint: Use Exercise 13.4.2 and part (a).] 

(c) Show that E(Y |X) = 0, y P(Y = y| X). 


Exercise 13.4.4. Let X and Y be random variables with joint dis- 
tribution given by L(X,Y) = dP = f(x,y)àz(dx,dy), where Az is two- 
dimensional Lebesgue measure, and f : R? — R is a non-negative Borel- 
measurable function with Jre fdàz = 1. (Example 13.1.1 corresponds to 
the case f(x,y) = 417(z,y).) Show that we can take P(Y € B|X) = 
Ja gx(y)Aldy) and E(Y |X) = fp ygx(y)A(dy), where the function gz : 


13.4. EXERCISES. 159 


R — R is defined by g,(y) = EOE whenever fr f(«,t)A(dt) is pos- 
nit 


itive and finite, otherwise (say) gz(y) = 0. 


Exercise 13.4.5. Let Q = {1,2,3}, and define random variables X and 
Y by Y(w) =w, and X(1) = X(2) = 5 and X(3) = 6. Let Z = E(Y | X). 
(a) Describe o(X) precisely. 

(b) Describe (with proof) Z(w) for each w € Q. 


Exercise 13.4.6. Let G be a sub-o-algebra, and let X and Y be two in- 
dependent random variables. Prove by example that E(X |G) and E(Y |G) 
need not be independent. [Hint: Don’t forget Exercise 3.6.3(a).| 


Exercise 13.4.7. Suppose Y is o(X)-measurable, and also X and Y are 
independent. Prove that there is C € R with P(Y = C) = 1. [Hint: First 
prove that P(Y < y) =0 or 1 for each y € R] 


Exercise 13.4.8. Suppose Y is G-measurable. Prove that Var(Y |G) = 0. 


Exercise 13.4.9. Suppose X and Y are independent. 

(a) Prove that E(Y | X) = E(Y) w.p. 1. 

(b) Prove that Var(Y | X) = Var(Y) w.p. 1. 

(c) Explicitly verify Theorem 13.3.1 (with G = o(X)) in this case. 


Exercise 13.4.10. Give an example of jointly defined random variables 
which are not independent, but such that E(Y | X) = E(Y) w.p. 1. 


Exercise 13.4.11. Let X and Y be jointly defined random variables. 
(a) Suppose E(Y | X) = E(Y) w.p. 1. Prove that E(XY) = E(X) E(Y). 
(b) Give an example where E(XY) = E(X) E(Y), but it is not the case 
that E(Y | X) = E(Y) w.p. 1. 


Exercise 13.4.12. Let {Zn} be independent, each with finite mean. Let 
Xo = a, and Xn =a + Zi +... + Zn for n > 1. Prove that 


E(Xn41 | Xo, X1,- oo Xn) = Xn + E(Zn+1) : 


Exercise 13.4.13. Let Xo, Xı,... be a Markov chain on a countable 
state space S, with transition probabilities {p;;}, and with E|X,,| < œœ for 
all n. Prove that with probability 1: 

(a) E(Xn+1 | Xo, X1,- --, Xn) = Vyex I Px,j- [Hint: Don’t forget Exer- 
cise 13.4.3. 

(b) E(Xn41 | Xn) = jer I PXnj- 
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13.5. Section summary. 


This section discussed conditioning, with an emphasis on the problems 
that arise (and their partial resolution) when conditioning on events of prob- 
ability 0, and more generally on random variables and on sub-o-algebras. It 
presented definitions, examples, and various properties of conditional prob- 
ability and conditional expectation in these cases. 
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14. Martingales. 


In this section we study a special kind of stochastic process called a 
martingale. A stochastic process Xo, X1,... is a martingale if E|X,,| < co 
for all n, and with probability 1, 


E (Xnii| Xo, X1,---,Xn) = Xn- 


Of course, this really means that E(Xn+41|Fn) = Xn, where Fn is the g- 
algebra o(Xo,.X1,-..,Xn). Intuitively, it says that on average, the value of 
Xn+1 is the same as that of Xn. 


Remark 14.0.1. More generally, we can define a martingale by specifying 
that E (Xn+1 | Fn) = Xn for some choice of increasing sub-o-fields Fa such 
that Xn is measurable with respect to Fn. But then o(Xo,...,Xn) G Fn, 
so by Proposition 13.2.7, 


E(Xn41|Xo,---; Xn) = E[E(Xn+1| Fn) | Xos- -3 Xn] 
= E[X,|Xo,...,Xn] = Xn, 


so the above definition is also satisfied, i.e. the two definitions are equivalent. 


Markov chains often provide good examples of martingales (though there 
are non-Markovian martingales too, cf. Exercise 14.4.1). Indeed, by Exer- 
cise 13.4.13, a Markov chain on a countable state space S, with transition 
probabilities p;;, and with E|.X,,| < oo, will be a martingale provided that 


Ying =i, ies. (14.0.2) 


jes 


(Intuitively, given that the chain is at state i at time n, on average it will 
still equal i at time n+ 1.) An important specific case is simple symmetric 
random walk, where S = Z and Xp = 0 and pii-a = Piip = 3 for all 
LEZ. 

We shall also have occasion to consider submartingales and supermartin- 
gales. The sequence Xp, X1,... is a submartingale if E|X,| < co for all n, 
and also 

E (Xn41| Xo, X1,---;Xn) > Xn. (14.0.3) 


It is a supermartingale if E|X,| < oo for all n, and also 
E (Xn+1 | Xo,.X1,---,Xn) S Xn. 


(These names are very standard, even though they are arguably the reverse 
of what they should be.) Thus, a process is a martingale if and only if it is 
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both a submartingale and a supermartingale. And, {Xn} is a submartin- 
gale if and only if {—X,} is a supermartingale. Furthermore, again by 
Exercise 13.4.13, a Markov chain {Xn} with E|X,,| < œ is a submartingale 
if jes I Pij > i for all i € S, or is a supermartingale if yes J Pij <i 
for alli € S. Similarly, if Xp = 0 and X, = Zı +... + Zn, where {Z;} 
are i.i.d. with finite mean, then by Exercise 13.4.12, {Xn} is a martingale 
if E(Z;) = 0, is a supermartingale if E(Z;) < 0; or is a submartingale if 
E(Z;) > 0. 

If {Xn} is a submartingale, then taking expectations of both sides 
of (14.0.3) gives that E(X,41) > E(Xn), so by induction 


E(X,) > E(Xn_-1) >... > E(X) > E(Xo). (14.0.4) 
Similarly, using Proposition 13.2.7, 
E [Xn42|Xo,---,Xn} = E[E(Xni2|Xo,-.-,Xnti) | Xo,.--, Xn] 


> E [Xn | X0,- esta] > Xn, 


so by induction 
E (Xm | Xo, X1,- Xn) > Xn, m>n. (14.0.5) 


For supermartingales, analogous statements to (14.0.4) and (14.0.5) follow 
with each > replaced by <, and for martingales they can be replaced by =. 


14.1. Stopping times. 


If Xo, X1, ... is a martingale, then as in (14.0.4), E(X,) = E(Xo) for all 
n € N. But what about E(X,), where 7 is a random time? That is, if we 
define a new random variable Y by Y(w) = Xr{w) (w), then must we also 
have E(Y) = E(Xo)? If 7 is independent of {Xn}, then E(X,) is simply a 
weighted average of different E(X,,), and therefore still equals E( Xo). But 
what if 7 and {X,,} are not independent? 

We shall assume that 7 is a stopping time, i.e. a non-negative-integer- 
valued random variable with the property that {r = n} € o(Xp,..., Xn). 
(Intuitively, this means that one can determine if r = n just by knowing 
the values Xo,...,Xn; T does not “look into the future” to decide whether 
or not to stop at time n.) Under these conditions, must it be true that 
E(X,) = E(Xo)? 

The answer to this question is no in general. Indeed, consider simple 
symmetric random walk, with Xo = 0, and let 7 = inf{n > 0;X, = —5}. 
Then 7 is a stopping time, since {r = n} = {Xo # —5, ..., Xn-1 Æ 
—5, Xn = 5} € o(Xo,..., Xn). Furthermore, since {Xn} is recurrent, we 
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have P(T < œ) = 1. On the other hand, clearly X, = —5 with probability 
1, so that E(X,) = —5, not 0. 

However, for bounded stopping times the situation is quite different, as 
the following result (one version of the Optional Sampling Theorem) shows. 


Theorem 14.1.1. Let {X,,} be a submartingale, let M € N, and let 
Tı, T2 be stopping times such that 0 < T} < T2 < M with probability 1. 
Then E(X,,) > E(X, ). 


Proof. Note first that {71 < k < r2} = {n <k-1N{m <k-1}S, 
so that (since 7; and 72 are stopping times) the event {7; < k < 72} is in 
o(Xo0,.--,Xk-1). We therefore have, using a telescoping series, linearity of 
expectation, and the definition of E(--- | Xo,...,X4-—1), that 


E(X,,) P= E(X;,) 

= E(X,, = Xn) 

=E (Dreng (Xe = X-1)) 

=E (ELO = Keak) 


= Lei E ((Xk — Xk-1)ln <k<r2) 
= Wee E ((E(Xx | Xo, ---, Xk-1) — Xk-1) Ln cksry) - 


But since {Xn} is a submartingale, (E(Xx | Xo,- --, Xk-1) — Xk-1) > 0. 
Therefore, E(X,,) — E(X- ) > 0, as claimed. B 


Corollary 14.1.2. Let {X,,} be a martingale, let M € N, and let 71, T2 
be stopping times such that 0 < 7, < T2 < M with probability 1. Then 
E(X,,) = E(X;,). 


Proof. Since both {X;,} and {—X,,} are submartingales, the proposition 
gives that E(X,,) > E(X,,) and also E(—X,,) > E(—X-,). The result 
follows. | 


Setting 7, = 0, we obtain: 


Corollary 14.1.3. If {X,} be a martingale, and r is a bounded stopping 
time, then E(X,) = E(Xo). 


Example 14.1.4. Let {Xn} be simple symmetric random walk with 
Xo = 0, and let 7 = min(10!?, inf{n > 0; Xn = —5}). Then 7 is indeed a 
bounded stopping time (with r < 10!2), so we must have E(X,) = 0. This 
is surprising since the probability is extremely high that r < 10!*, and in 
this case X, = —5. However, in the rare case that r = 10'*, X, will (on 
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average) be so large as to cancel this —5 out. More precisely, using the 
observation (13.0.1) that E(Z | B) = E(Z 13) /P(B) when P(B) > 0, we 
have that 


0 = E(X,) = E(X,1,<1012) + E(X,1,=1012) 


= E(X, |r < 10°?) P(r < 107) + E(X, |r = 10") P(r = 10°) 
= (—-5) P(r < 10!) + E(Xio12 |T = 10°) P(r = 10°). 


Here P(r < 10!?) ~ 1 and P(r = 10"”) = 0, but E(Xjo12|7 = 10!) is so 
huge that the equation still holds. 


A generalisation of Theorem 14.1.1, allowing for unbounded stopping 
times, is as follows. 


Theorem 14.1.5. Let {Xn} be a martingale with stopping time T. 
Suppose P(r < œ) = 1, and E|X,| < œ, and limp>oo E|Xn 1;5n] = 0. 
Then E(X,) = E(Xo). 


Proof. Let Zn = Xmin(rn) for n = 0,1,2,.... Then Zn = X,1p<n + 
Xnlr>n = X; — Xrlr>n + Xnlr>n, 80 Xr = Zn — Xnlr>n + Xrl;>n. 
Hence, 

E(X,) = E(Zn) — E|Xn1r>n] + E[X,1;>7]. 


Since min(r, n) is a bounded stopping time (cf. Exercise 14.4.6(b)), it follows 
from Corollary 14.1.3 that E(Z,) = E(Xo) for all n. As n — oo, the second 
term goes to 0 by assumption. Also, the third term goes to 0 by the Domi- 
nated Convergence Theorem, since E|X,| < oo, and 1,3, — 0 w.p. 1 since 
P[r < œ] = 1. Hence, letting n — œ, we obtain that E(X,) = E(Xo). B 


Remark 14.1.6. Theorem 14.1.5 in turn implies Corollary 14.1.3, since 
if r < M, then X,1,s, = 0 for all n > M, and also 


E|X,| = E|Xo lrao +... + Xm 1r=m| < E|Xo|+...+ ElXy| < co. 


However, this reasoning is circular, since we used Corollary 14.1.3 in the 
proof of Theorem 14.1.5. 


Corollary 14.1.7. Let {Xn} be a martingale with stopping time T, 
such that P[r < oo] = 1. Assume that {X,,} is bounded up to time 7, 
ie. that there is M < œ such that |Xn| incr < M1n<; for all n. Then 
E(X,) = E(Xo). 
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Proof. We have |X-,| < M, so that E|X,| < M < œ. Also |E(Xn1;sn)| < 
E(|Xn|l->n) < E(M 1,sn) = M P(T > n), which converges to 0 as n — co 
since P{r < oo] = 1. Hence, the result follows from Theorem 14.1.3. | 


Remark 14.1.8. For submartingales {X,} we may replace = by > in 
Corollary 14.1.3 and Theorem 14.1.5 and Corollary 14.1.7, while for super- 
martingales we may replace = by <. 


Another approach is given by: 


Theorem 14.1.9. Let 7 be a stopping time for a martingale {Xn}, with 
P(r < co) = 1. Then E(X,) = E(Xo) if and only if limpoo E|Xmin(r,n)] = 
E| limn—oo X min(r,n) [. 


Proof. By Corollary 14.1.3, limpoo E[Xmin(z,n)] = limno E(Xo) = 
E(Xo). Also, since P(t < oo) = 1, we must have P[limn—oo Xmin(r,n) 
X,| = 1, so Eflimn—oo Xmin(r,n)| = E(X,). The result follows. 


Combining Theorem 14.1.9 with the Dominated Convergence Theorem 
yields: 


Corollary 14.1.10. Let + be a stopping time for a martingale {Xn}, 
with P(T < oo) = 1. Suppose there is a random variable Y with E(Y) < co 
and |Xyin(r,n)| < Y for all n. Then E(X,) = E(Xo). 


As a particular case, we have: 
Corollary 14.1.11. Let r be a stopping time for a martingale {Xn}. 
Suppose E(r) < 00, and |X, — Xn_i| < M < œ for all n. Then E(X,) = 
E(Xo). 
Proof. Let Y = |Xo| + M7. Then E(Y) < E|Xo| + M E(T) < co. Also, 
|Xmin(r,n)! < |Xo| +M min(r, n) < Y. 


Hence, the result follows from Corollary 14.1.10. | 


Similarly, combining Theorem 14.1.9 with the Uniform Integrability 
Convergence Theorem yields: 


Corollary 14.1.12. Let T be a stopping time for a martingale {Xn}, 
with P(t < co) = 1. Suppose 


lim sup E (|Xmin(r,n)l LXi >a) = 0. 
aod: n 
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Then E(X,) = E(Xo). 


Example 14.1.13. (Sequence waiting time.) Let {rn}n>1 be infinite 
fair coin tossing (cf. Subsection 2.6), and let r = inf{n > 3 : Tn-2 = 
1,Tn-1 = 0, Tn = 1} be the first time the sequence heads-tails-heads is 
completed. Surprisingly, E(r) can be computed using martingales. Indeed, 
suppose that at each time n, a new player appears and bets $1 on tails, 
then if they win they bet $2 on heads, then if they win again they bet 
$4 on heads. (They stop betting as soon as they either lose once or win 
three bets in a row.) Let Sn be the total amount won by all the betters by 
time n. Then by construction, {Sn} is a martingale with stopping time 7, 
and furthermore |Sn — Sn-1| < 7 < oo. Also, E(T) < œœ, and S, = —7 +10 
by Exercise 14.4.12. Hence, by Corollary 14.1.11, 0 = E(S;) = —E(r) + 10, 
whence E(r) = 10. (See also Exercise 14.4.13.) 


Finally, we observe the following fact about E(X,) when {Xn} is a gen- 
eral random walk, i.e. is given by sums of i.i.d. random variables. (If the 
{Zi} are bounded, then part (a) below also follows from applying Corol- 
lary 14.1.11 to {X, — nm}.) 


Theorem 14.1.14. (Wald’s theorem) Let {Z;} be i.i.d. with finite 
mean m. Let Xo = a, and Xn =a+4,+...+2Z, forn > 1. Let + 
be a stopping time for {Xn}, with E(r) < oo. Then 

(a) E(X,) =a+mE(r); and furthermore 

(b) ifm = 0 and E(Z?) < œ, then Var(X,) = E(Z?) E(r). 


Proof. Since {i < 7} = {7 >i—1} € o(Xo,...,Xi-1) = 9(Zo,..., Zi-1), 
it follows from Lemma 3.5.2 that {i < r} and {Z; € B} are independent 
for any Borel B G R, so that 1j<, and Z; are independent for each i. 

For part (a), assume first that the Z; are non-negative. It follows from 
countable linearity and independence that 


E(X, -—a) = E(Z,+...4+2Z,) = B(X Zus) 
t=1 


= VEZilice) = )IE(Z) Ellice) 


i=1 
oo 


= XC E(Z1) E(licr) = E(Z,) X P(r > i) = E(2,) E(7), 


i=l i=l 


the last equality following from Proposition 4.2.9. 
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For general Z;, the above shows that E( J; |Zi|li<+) = E|Z:|E(r) < 
oo. Hence, by Exercise 4.5.14(a) or Corollary 9.4.4, we can write 


E(X, — a) = (5 Zt) = SCE(Z; li<r) = E(Z,) E(r). 


For part (b), if m = 0 then from the above E(X,) = a. Assume first 
that 7 is bounded, i.e. 7 < M < oo. Then using part (a), 


M 
Var(X,) = E((X,-a)*) = e(($2 ne) ) 


M 
=E (Dar ker) +0 = E(Z?)K(r), 
i=1 


where the cross terms equal 0 since as before {j < T} € o(Z,..., Zj—1) so 
Zi 1j<r is independent of Zj, and hence 


E | X` ZiZj licr ljsr| =E{ JO 425 1j<r 
i<j 1<i<j<M 


= SO E(ZZlj<s)= SS E(Z1j<r) E(Z;) =0. 
1<i<j<M 1<i<j<M 
If 7 is not bounded, then let pp = min(7, k) be corresponding bounded 
stopping times. The above gives that Var(X>,) = E(Z?) E(p;,) for each k. 
As k > œ, E(p,) — E(r) by the Monotone Convergence Theorem, so the 
result follows from Lemma 14.1.15 below. E 


The final argument in the proof of Theorem 14.1.14 requires two tech- 
nical lemmas involving L? theory: 


Lemma 14.1.15. In the proof of Theorem 14.1.14, limpoo Var(X p.) = 
Var(X+). 


Proof (optional). Clearly X,, > X; a.s. Let ||W|| = VE(W?) be the 
usual L? norm. Note first that for m < n, using Theorem 14.1.14(a) and 
since the cross terms again vanish, we have 


Xp, — Xon ll? = E[(Xp, — Xpm)'] = El(Zpm+1 wirtz oe 


= E22 ..4+...422]40 = EZ? +... +Z} ]- EIZ? +... + Zom] 
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= E(Z2)E(pn) — E(Z2)E(pm) = E(Z?)[E(pn) — E(pm)| 


which > E(Z?)[E(7) — E(7)| = 0 as n,m — oo. This means that the se- 
quence {X,, }92., is a Cauchy sequence in L?, and since L? is complete (see 
e.g. Dudley (1989), Theorem 5.2.1), we must have limp—seo ||Xp,, — Y|| = 0 
for some random variable Y. It then follows from Lemma 14.1.16 below 
(with p = 2) that Y = X, a.s., so that limp—.oo ||X,, — X;|| = 0. The 
triangle inequality then gives that |||X,,, — all — IX- — all| < ||Xp, — X-ll, 
so limn—o0 ||Xp,, — || = |X, —al]. Hence, limp co ||Xp,, — all? = |X- —all?, 
ie. limpo Var(X?) = Var(X?). | 


Lemma 14.1.16. If {Yn} > Y in L?, ie. limno E(|Yn — YIP) = 0, 
where 1 < p < «, then there is a subsequence {Yn,} with Yn, > Y a.s. In 
particular, if {Y,} — Y in L”, and also {Y,} > Z a.s., then Y = Z a.s. 


Proof (optional). Let {Y,,} be any subsequence such that E(|Yn, — 
Y|P) <47-*, and let Ay = {w € 2: |Yn, (w) — Y (w)| > 27*/?}. Then 


4-* > E(|¥n,—Y|?) > E(lYm -YPP Lax) 


> (2-*/?)\P P(Ak) = 2-* P(A). 


Hence, P(Ax) < 27*, so 33, P(Ax) = 1 < 0, so by the Borel-Cantelli 
Lemma, P(limsup, Ax) = 0. For any € > 0, since {w € Q : |Y, (w) — 
Y(w)| > e} C Ax for all k > p logy(1/e), this shows that P(|Yn,(w) — 
Y(w)| > € i.o.) < P(limsup, Ak) = 0. Lemma 5.2.1 then implies that 
{Yn} > Y as. | 


14.2. Martingale convergence. 


If {Xn} is a martingale (or a submartingale), will it converge almost 
surely to some (random) value? 

Of course, the answer to this question in general is no. Indeed, simple 
symmetric random walk is a martingale, but it is recurrent, so it will forever 
oscillate between all the integers, without ever settling down anywhere. 

On the other hand, consider the Markov chain on the non-negative 
integers with (say) Xo = 50, and with transition probabilities given by 
Pij = z for 0 < j < 2i, with p,; = 0 otherwise. (That is, if Xn = i, 
then X,41 is uniformly distributed on the 2i + 1 points 0, 1,..., 2i.) This 
Markov chain is a martingale. Clearly if it converges at all it must converge 
to 0; but does it? 
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This question is answered by: 


Theorem 14.2.1. (Martingale Convergence Theorem) Let {Xn} be 
a submartingale. Suppose that sup, E|X,| < oo. Then there is a (finite) 
random variable X such that X, — X almost surely (a.s.). 


We shall prove Theorem 14.2.1 below. We first note the following im- 
mediate corollary. 


Corollary 14.2.2. Let {Xn} be a martingale which is non-negative (or 
more generally satisfies that either Xn > C for all n, or Xn < C for all 
n, for some C € R). Then there is a (finite) random variable X such that 
Xn — X a.s. 


Proof. If {X,} is a non-negative martingale, then E|X,| = E(X,) = 
E(Xo0), so the result follows directly from Theorem 14.2.1. 

If Xn > C [respectively Xn < C], then the result follows since {Xn — C} 
[respectively {—X, + C}] is a non-negative martingale. E 


It follows from this corollary that the above Markov chain example does 
indeed converge to 0 a.s. 

For a second example, suppose Xo = 50 and that pij = EEES 
for |j — i| < min(i, 100 — i). This Markov chain lives on {0, 1,2,...,100}, at 
each stage jumping uniformly to one of the points within min(i, 100 — i) of 
its current position. By Corollary 14.2.2, {Xn} —> X a.s. for some random 
variable X, and indeed it is easily seen that P(X = 0) = P(X = 100) = 4. 

For a third example, let S = {2”; n € Z}, with a Markov chain on S hav- 
ing transition probabilities p; 2; = ay Dia = 2 This is again a martingale, 
and in fact it converges to 0 a.s. (even though it is unbounded). 

For a fourth example, let S = N be the set of positive integers, with 
Xo = 1. Let py = 1 for i even, with pii_1 = 2/3 and piit2 = 1/3 
for i odd. Then it is easy to see that this Markov chain is a non-negative 
martingale, which converges a.s. to arandom variable X having the property 
that P(X = i) > 0 for every even non-negative integer i. 


It remains to prove Theorem 14.2.1. We require the following lemma. 


Lemma 14.2.3. (Upcrossing Lemma) Let {Xn} be a submartingale. 
For MEN anda < 8, let 


ue? = sup{k; 3a < bi <ag<bo<a3<... 


<ak < bk < M; Xa SQ, Xp, > 8}. 
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(Intuitively, ue? represents the number of times the process {Xn} “up- 
crosses” the interval |œ, 3] by time M.) Then 


E(Us}") < E|Xu — Xol / (8 — 0). 


Proof. By Exercise 14.4.2, the sequence {max(Xņ„,@)} is also a sub- 
martingale, and it clearly has the same number of upcrossings of [a, 8] as 
does {Xn}. Furthermore, | max(X,a)—max(Xo,a)| < |Xm — Xol. Hence, 
for the remainder of the proof we can (and do) replace Xp, by max(Xn,@), 
i.e. we assume that Xn > a for all n. 

Let up = vo = 0, and iteratively define uj and vj for j > 1 by 


uj = min (M, inf{k > vj-1; Xk < a}) ; 


vj = min (M, inf{k > uj; Xk > b}. 


We necessarily have vm = M, so that 
E(Xm) = E(Xvu) 


= E(Xoy — Xum + Xum — Xvm-1 +Xvm-i — Xum +--+ Xu, — Xo + Xo) 


M M 
i E(Xo) +E (Soo = mo) T XE (Xur = Xor) z 
k=1 k=1 


Now, E(Xu, — Xv,_,) > 0 by Theorem 14.1.1, since {Xn} is a sub- 
martingale and vk—ı < uk < M. Hence, ye, 4 E(Xu, — Xop) > 0. (This 
is the subtlest part of the proof; since usually Xu, <a and Xu, > 8, it 
is surprising that E(X,, — Xv) > 0.) 

Furthermore, we claim that 


M 
E (Sas = mo) >E (6 Ei a) us") : (14.2.4) 


k=1 


Indeed, each upcrossing contributes at least G—a to the sum. And, any “null 
cycle” where up = vy = M contributes nothing, since then X,, — Xu, = 
Xm—Xm =0. Finally, we may have one “incomplete cycle” where ug < M 
but v = M. However, since we are assuming that Xn > a, we must have 
Xos ~Xu, = Xu—Xu, 2 a—-—a@ = 0in this case; that is, such an incomplete 
cycle could only increase the sum. This establishes (14.2.4). 


We conclude that E(Xy) > E(Xo)+(G-—a)E (ak Re-arranging, 
(B-E (a) < E(Xm)-E(Xo) < E|Xm—Xol, which gives the result. W 
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Proof of Theorem 14.2.1. Let K = sup, E (|Xn]) < oo. Note first that 
by Fatou’s lemma, 


E (lim inf |Xn) < liminf E|X,| < K < œ. 
n nm 


It follows that P (|X,| > oo) = 0, i.e. {Xn} will not diverge to +oo. 
Suppose now that P (lim inf Xn < lim sup Xn) > 0. Then since 


{ liminf X, < limsup Xn} 


1 
=UU {liminf Xn <q <q+7 < limsup Xn}, 
qEQ kEN 


it follows from countable subadditivity that we can find a, 8 € Q with 


P(liminf Xn <a < £ < limsup Xn) > 0. 


With uae as in Lemma 14.2.3, this implies that P (limmo Ue = œ) > 
0, so E (immo US? = o0) = oo. Then by the monotone convergence 


theorem, lim M—oo E(U@*) = E( limmo% Ue) = oo. But Lemma 14.2.3 
says that for all M € N, E(UG?*) < ekg < et a contradiction. 
We conclude that P(limn—>o |Xn| = œ) = 0 and also P(liminf X, < 
lim sup Xn) = 0. Hence, we must have P(limn_oo Xn exists and is finite) = 
1, as claimed. E 


14.3. Maximal inequality. 


Markov’s inequality says that for a > 0, P(Xo > a) < P(|Xo| > a) < 
E|Xo| /a. Surprisingly, for a submartingale, the same inequality holds with 
Xo replaced by maxo<i<n Xi, even though usually (maxo<i<n Xi) > Xo. 


Theorem 14.3.1. (Martingale maximal inequality) If {Xn} is a sub- 
martingale, then for all a > 0, 


E|X 
P [( gx xi) > | < ElXal 
0<i<n a 


Proof. Let A, be the event {X;, > a, but X; < a fori < k}, ie. the 


event that the process first reaches a at time k. And let A = Ùo<k<nAk 
be the event that the process reaches a@ by time n. 
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We need to show that P(A) < E|X,| / a, ie. that a P(A) < E|X,]. To 
that end, we compute (letting Fk = o(Xo, X1,...,Xx)) that 
aP(A) = Sp ,oeP(Ar) since {Ax} disjoint 
ye E (alar) 


< Pro E(Xrla,) since X; > a on Ák 
< Pp- E(E(Xn | Fe)1A,) by (14.0.5) 
= WeeoE(Xnla,) since Ax € Fr 
= E(Xn1a) since {4x} disjoint 
< E(|X,|14) 
< BlXx,|, 
as required. E 


Theorem 14.3.1 is clearly false if {X,} is not a submartingale. For 
example, if the X; are i.i.d. equal to 0 or 2 each with probability , and if 
a = 2, then P [(maxo<i<n Xi) > a] = 1 — (4)”+!, which for n > 1 is not 
< E|X,.| / a= E. 

If {Xn} is in fact a non-negative martingale, then E|X,| = E(X,) = 
E(Xo), so the bound in Theorem 14.3.1 does not depend on n. Hence, 
letting n — oo and using continuity of probabilities, we obtain: 


Corollary 14.3.2. If {Xn} is a non-negative martingale, then for all 


a> 0, 
P ( sup Xi) > al < E(Xo) | 


0<i<oo Q 


For example, consider the third Markov chain example above, where 
S = {2”;n € Z} and pigi = 3 Pii = Z, If, say, Xo = 1, then we obtain 
that P [(supo<; <œ Xi) 2 2] < 4, which is perhaps surprising. (This result 
also follows from applying (7.2.7) to the simple non-symmetric random walk 
{logy Xn}-) 


Remark 14.3.3. (Martingale Central Limit Theorem) If X, = Zo + 
...+Z,, where {Z;} are i.i.d. with mean 0 and variance 1, then we already 
know from the classical Central Limit Theorem that Ža => N(0,1), ie. 
that the properly normalised X,, converge weakly to a standard normal 


distribution. In fact, a similar result is true for more general martingales 
{Xn}. Let Fn = o (Xo, X1,.-.,Xn). Set o? = Var(Xo), and for n > 1 set 


nle 


o2 = Var(Xn|Fn-1) = E(X? -— X2 |Fn-1). 


Then follows from induction that Var(X,,) = Xpo E (02), and of course 
E(Xn) = E(Xe) does not grow with n. Hence, if we set v = min{n > 
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0; Seo 0% > t}, then for large t, the random variable X,,/Vt has mean 


Žo = N (0,1) as t > co 


under certain conditions, for example if >, 02 = oo with probability one 
and the differences Xn — X,—1 are uniformly bounded. For a proof see e.g. 
Billingley (1995, Theorem 35.11). 


close to 0 and variance close to 1. We then have 


14.4. Exercises. 


Exercise 14.4.1. Let {Z;} be iid. with P(Z; = 1) = P(Z; = —1) = 1/2. 
Let Xo = 0, X1 = Z4, and for n > 2, Xn = Xn-1 + (1+ Z1 +... + Zn—1)(2 * 
Zn — 1). (Intuitively, this corresponds to wagering, at each time n, one 
dollar more than the number of previous victories.) 

(a) Prove that {Xn} is a martingale. 

(b) Prove that {Xn} is not a Markov chain. 


Exercise 14.4.2. Let {X,} be a submartingale, and let a € R. Let 
Yn = max(X,,a). Prove that {Yn} is also a submartingale. (Hint: Use 
Exercise 4.5.2.) 


Exercise 14.4.3. Let {Xn} be a non-negative submartingale. Let 
Yn = (X,)?. Assuming E(Y,) < oo for all n, prove that {Yn} is also a 
submartingale. 


Exercise 14.4.4. The conditional Jensen’s inequality states that if ¢ is 
a convex function, then E(¢(X)|G) > ¢@(E(X |G)). 

(a) Assuming this, prove that if {X,,} is asubmartingale, then so is {@(Xp)} 
whenever ¢ is non-decreasing and convex with E|¢(X,,)| < œ for all n. 

(b) Show that the conclusions of the two previous exercises follow from 
part (a). 


Exercise 14.4.5. Let Z be a random variable on a probability triple 
(Q, F,P), and let Go C G1 C ... C F be a nested sequence of sub-c-algebras. 
Let Xn = E(Z | Gn). (If we think of G, as the amount of information we have 
available at time n, then X, represents our best guess of the value Z at time 
n.) Prove that Xo, X1,... is a martingale. [Hint: Use Proposition 13.2.7 to 
show that E(Xn41|Gn) = Xn. Then use the fact that X; is G;-measurable 
to prove that o(Xo,..., Xn) E Gn] 


Exercise 14.4.6. Let {X,} be a stochastic process, let 7 and p be two 
non-negative-integer-valued random variables, and let m € N. 

(a) Prove that 7 is a stopping time for {X,} if and only if {r < n} € 
a(Xo,..-,Xn) for all n > 0. 
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(b) Prove that if 7 is a stopping time, then so is min(r, m). 
(c) Prove that if T and p are stopping times for {Xn}, then so is min(r, p). 


Exercise 14.4.7. Let C € R, and let {Z;} be ani.id. collection of random 
variables with P[Z; = —1] = 3/4 and P[Z; = C] = 1/4. Let Xo = 5, and 
Xn =54+24,4+2Z94+...4+Z, forn>1. 

(a) Find a value of C such that {Xn} is a martingale. 

(b) For this value of C, prove or disprove that there is a random variable 
X such that as n — co, Xn — X with probability 1. 

(c) For this value of C, prove or disprove that P[X, = 0 for some n € 
N] = 1. 


Exercise 14.4.8. Let {Xn} be simple symmetric random walk, with 
Xo = 0. Let T = inf{n > 5: Xn41 = Xn +1} be the first time after 4 which 
is just before the chain increases. Let p =T +1. 

(a) Is 7 a stopping time? Is p a stopping time? 

(b) Use Theorem 14.1.5 to compute E(X,). 

(c) Use the result of part (b) to compute E(X,). Why does this not 
contradict Theorem 14.1.5? 


Exercise 14.4.9. Let 0 < a < c be integers, with c > 3. Let {Xn} be 
simple symmetric random walk with Xo = a, let o = inf{n > 1: Xn = 
0 or c}, and let p = o — 1. Determine whether or not E(X,) = a, and 
whether or not E(X,) =a. Relate the results to Corollary 14.1.7. 


Exercise 14.4.10. Let 0 < a < c be integers. Let {Xn} be simple 
symmetric random walk, started at Xp = a. Let T = inf{n > 1; Xn =0or 
ch. 

(a) Prove that {Xn} is a martingale. 

(b) Prove that E(X,) = a. [Hint: Use Corollary 14.1.7.] 

(c) Use this fact to derive an alternative proof of the gambler’s ruin formula 
given in Section 7.2, for the case p = 1/2. 


Exercise 14.4.11. Let 0 < p < 1 with p 4 1/2, and let 0 < a < c be 
integers. Let {X,} be simple random walk with parameter p, started at 
Xo =a. Let T = inf{n > 1; Xn = 0 or c}. Let Z, = ((1 — p)/p)*” for 
n=0,1,2,.... 

(a) Prove that {Zn} is a martingale. 

(b) Prove that E(Z,) = ((1 — p)/p)*. [Hint: Use Corollary 14.1.7,] 

(c) Use this fact to derive an alternative proof of the gambler’s ruin formula 
given in Section 7.2, for the case p Æ 1/2. 


Exercise 14.4.12. Let {Sn} and 7 be as in Example 14.1.13. 
(a) Prove that E(T) < oo. [Hint: Show that P(r > 3m) < (7/8)™, and 


14.4. EXERCISES. 175 


use Proposition 4.2.9.] 
(b) Prove that S, = —r + 10. (Hint: By considering the 7 different players 
one at a time, argue that S, = (T — 3)(-1) + 7-141] 


Exercise 14.4.13. Similar to Example 14.1.13, let {rn}n>1 be infinite 
fair coin tossing, o = inf{n > 3 : rn-2 = 0, Tn-1 = 1l, fn = 1}, and 
p= inf{n > 4: rp_3 = 0, rn-2 = 1, rn-1 = 9, Pn = 1}. 

(a) Describe o and p in plain English. 

(b) Compute E(o). 

(c) Does E(o) = E(r), with 7 as in Example 14.1.13? Why or why not? 
(d) Compute E(p). 


Exercise 14.4.14. Why does the proof of Theorem 14.1.1 fail if M = oo? 
[Hint: Exercise 4.5.14 may help.] 


Exercise 14.4.15. Modify the proof of Theorem 14.1.1 to show that 
if {Xn} is a submartingale with |Xn+1 — Xn| < M < œ, and n < M 
are stopping times (not necessarily bounded) with E(72) < oo, then we 
still have E(X,,) > E(X;,). [Hint: Corollary 9.4.4 may help.] (Compare 
Corollary 14.1.11.) 


Exercise 14.4.16. Let {Xn} be simple symmetric random walk, with 
Xo = 10. Let T = min{n > 1; Xn = 0}, and let Yp = Xmin(n,r)- Determine 
(with explanation) whether each of the following statements is true or false. 
(a) E(X200) = 10. 

(b) E(Y200) = 10. 

(c) E(X,) = 10. 

(d) E(¥;) = 10. 

(e) There is a random variable X such that {Xn} > X as. 

(£) There is a random variable Y such that {Yn} > Y as. 


Exercise 14.4.17. Let {Xn} be simple symmetric random walk with 
Xo = 0, and let 7 = inf{n > 0; Xn = —5}. 

(a) What is E(X,) in this case? 

(b) Why does this fact not contradict Wald’s theorem part (a) (with a = 
m= 0)? 


Exercise 14.4.18. Let 0 < p< 1 with p # 1/2, and let 0 < a < c be 
integers. Let {Xn} be simple random walk with parameter p, started at 
Xo =a. Let T = inf{n > 1; Xn = 0 or c}. (Thus, from the gambler’s ruin 
solution of Subsection 7.2, P(X, = c) = [((1—p)/p)*—1] / [((1—p)/p)°—1].) 
(a) Compute E(X,) by direct computation. 

(b) Use Wald’s theorem part (a) to compute E(7) in terms of E(X,). 
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(c) Prove that the game’s expected duration satisfies E(r) = (a — c|((1 — 


p)/p)* — 1] /[((1 ~ p)/p)* — 1) / (1 ~ 2p). 


(d) Show that the limit of E(r) as p — 1/2 is equal to a(c—a). , 


Exercise 14.4.19. Let 0 < a < c be integers, and let {X,} be simple 
symmetric random walk with Xo =a. Let T = inf{n > 1; Xn = 0 or c}. 
(a) Compute Var(X,) by direct computation. 

(b) Use Wald’s theorem part (b) to compute E(7) in terms of Var(X,). 
(c) Prove that the game’s expected duration satisfies E(r) = a(c— a). 
(d) Relate this result to part (d) of the previous exercise. 


Exercise 14.4.20. Let {Xn} be a martingale with |X,+41 — Xn] < 10 for 
all n. Let r = inf{n > 1:|X,| > 100}. 

(a) Prove or disprove that this implies that P(r < œ) = 1. 

(b) Prove or disprove that this implies there is a random variable X with 
{Xn} > X as. 

(c) Prove or disprove that this implies that P [r < œ, or there is a random 
variable X with {X,} > X] = 1. [Hint: Let Yn = Xmin(r,n)-] 


Exercise 14.4.21. Let Z1, Z2,... to be independent, with 


t t 

P(Z = z) = = and = P(Z; = -2*) = = 
Let Xo = 0, and Xn = 7, +...4+ Zn forn> 1. 

(a) Prove that {Xn} is a martingale. [Hint: Don’t forget (14.0.2).] 

(b) Prove that P[Z; > 1 a.a.] = 1, i.e. that with probability 1, Z; > 1 for 
all but finitely many i. [Hint: Don’t forget the Borel-Cantelli Lemma.] 

(c) Prove that P{lim,+ooXn = œ] = 1. (Hence, even though {Xn} is a 
martingale and thus represents a player’s fortune in a “fair” game, it is still 
certain that the player’s fortune will converge to +00.) 

(d) Why does this result not contradict Corollary 14.2.2? 

(e) Let 7r = inf{n > 1: Xn < 0}, and let Yp = Xmin(r,n) Prove that 
{Yn} is also a martingale, and that P[lim,—oo Yn = co] > 0. Why does this 
result not contradict Corollary 14.2.2? 


14.5. Section summary. 


This section provided a brief introduction to martingales, including 
submartingales, supermartingales, and stopping times. Some examples 
were given, mostly arising from Markov chains. Various versions were of 
the Optional Sampling Theorem were proved, giving conditions such that 
E(X,) = E(Xo). This together with the Upcrossing Lemma was then used 
to prove the important Martingale Convergence Theorem. Finally, the Mar- 
tingale maximal inequality was presented. 
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15. General stochastic processes. 


We end this text with a brief look at more general stochastic processes. 
We attempt to give an intuitive discussion of this area, without being overly 
careful about mathematical precision. (A full treatment of these processes 
would require another course, perhaps following one of the books in Sub- 
section B.5.) In particular, in this section a number of results are stated 
without being proved, and a number of equations are derived in an intuitive 
and non-rigorous manner. All exercises are grouped by subsection, rather 
than at the end of the entire section. 

To begin, we define a (completely general) stochastic process to be any 
collection {X;; t € T} of random variables defined jointly on some probabil- 
ity triple. Here T can be any non-empty index set. If T = {0,1,2,...} then 
this corresponds to our usual discrete-time stochastic processes; if T = R2° 
is the non-negative real numbers, then this corresponds to a continuous-time 
stochastic process as discussed below. 


15.1. Kolmogorov Existence Theorem. 


As with all mathematical objects, a proper analysis of stochastic pro- 
cesses should begin with a proof that they exist. In two places in this text 
(Theorems 7.1.1 and 8.1.1), we have proved the existence of certain random 
variables, defined on certain underlying probability triples, having certain 
specified properties. The Kolmogorov Existence Theorem is a huge general- 
isation of these results, which allows us to define stochastic processes quite 
generally, as we now discuss. 

Given a stochastic process {X;; t € T}, and k € N, and a finite collec- 
tion t1,...,f; € T of distinct index values, we define the Borel probability 
measure /lz,...4, ON RÝ by 


Hitl H) = P((Xas-- Xn) €H), HCR” Borel. 


The distributions {uz,..4,; k € N, t1,...,tk € T distinct} are called the 
finite-dimensional distributions for the stochastic process {X+; t € T}. 

These finite-dimensional distributions clearly satisfy two sorts of consis- 
tency conditions: 


(C1) If (s(1), s(2),...,8(k)) is any permutation of (1,2,...,k) (meaning 
that s : {1,2,...,k} — {1,2,...,k} is one-to-one), then for distinct 
tı,--. tk E T, and any Borel Hy,...,H, C R, we have 


Ht. (Hi x... x Hk) = Hts -ts (Hs) X... X Hy) - (15.1.1) 
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That is, if we permute the indices ¢;, and correspondingly modify the 
set H = Hı x ... x Hx, then we do not change the probabilities. For 
example, we must have P(X € A, Y € B)=P(Y € B, X € A), even 
though this will not usually equal P(Y € A, X € B). 


(C2) For distinct t1,...,t, E€ T, and any Borel Hj,...,H,-1 C R, we have 
Lty...tr (Hı D ES Hki x R) = Htt (Mı Krad Hy-1) . (15.1.2) 


That is, allowing X;, to be anywhere in R is equivalent to not men- 
tioning X+, at all. For example, P(X € A, Y € R) = P(X € A). 


Conditions (C1) and (C2) are quite obvious and uninteresting. However, 
what is surprising is that they have an immediate converse; that is, for any 
collection of finite-dimensional distributions satisfying them, there exists a 
corresponding stochastic process. A formal statement is: 


Theorem 15.1.3. (Kolmogorov Existence Theorem) A family of Borel 
probability measures {p1,..4,; k E N, ti E€ T distinct}, with pt,.4, 2 
measure on R*, satisfies the consistency conditions (C1) and (C2) above 
if and only if there exists a probability triple (RT, FT,P), and random 
variables {X:}rer defined on this triple, such that for all k € N, distinct 
ty,...,t, E T, and Borel H C R*, we have 


P (Xans: Xa) € H) = ph. (H). (15.1.4) 


The theorem thus says that, under extremely general conditions, stochas- 
tic processes exist. Theorems 7.1.1 and 8.1.1 follow immediately as special 
cases (Exercises 15.1.6 and 15.1.7). 

The “only if” direction is immediate, as discussed above. To prove the 
“if” direction, we can take 


R7 = {all functions T > R} 


and 
F7 =o {{X: € H}; tET, H CR Borel} . 


The idea of the proof is to first use (15.1.4) to define P for subsets of the 
form {X;,,...,Xz,) € H} (for distinct t,,...,t, € T, and Borel H C RF), 
and then extend the definition of P to the entire o-algebra FT, analogously 
to the proof of the Extension Theorem (Theorem 2.3.1). The argument is 
somewhat involved; for details see e.g. Billingsley (1995, Theorem 36.1). 


Exercise 15.1.5. Suppose that (15.1.1) holds whenever s is a transpo- 
sition, i.e. a one-to-one function s : {1,2,..., k} — {1,2,...,k} such that 
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s(i) = i for k—2 choices of i € {1,2,...,k}. Prove that the first consistency 
condition is satisfied, i.e. that (15.1.1) holds for all permutations s. 


Exercise 15.1.6. Let X 1, X2,... be independent, with Xn ~ Hn. 

(a) Specify the finite-dimensional distributions pz, ,tz,...,tẹ for distinct non- 
negative integers t1,t2,..., tx. 

(b) Prove that these Hi, .,...,.2, Satisfy (15.1.1) and (15.1.2). 

(c) Prove that Theorem 7.1.1 follows from Theorem 15.1.3. 


Exercise 15.1.7. Consider the definition of a discrete Markov chain 
{Xn}, from Section 8. 

(a) Specify the finite-dimensional distributions jz, 4,,..,2, for non-negative 
integers tı < t2 <... < tk. 

(b) Prove that they satisfy (15.1.1). 

(c) Specify the finite-dimensional distributions Ht ,tz,...,tẹ for general dis- 
tinct non-negative integers t1, t2,...,¢,, and explain why they satisfy (15.1.2). 
[Hint: It may be helpful to define the order statistics, whereby tp) is the 
r‘h_largest element of {t1,t2,..., tk} for 1<r<k_] 

(d) Prove that Theorem 8.1.1 follows from Theorem 15.1.3. 


15.2. Markov chains on general state spaces. 


In Section 8 we considered Markov chains on countable state spaces 
S, in terms of an initial distribution {vi}ics and transition probabilities 
{Pij }i jcs- We now generalise many of the notions there to general (perhaps 
uncountable) state spaces. 

We require a general state space X, which is any non-empty (perhaps 
uncountable) set, together with a o-algebra F of measurable subsets. The 
transition probabilities are then given by {P(x, A)}zex, acer. We make the 
following two assumptions: 

(A1) For each fixed x € 4’, P(z,-) is a probability measure on (4, F). 
(A2) For each fixed A € F, P(x, A) is a measurable function of z € ¥. 
Intuitively, P(x, A) is the probability, if the chain is at a point v, that it 
will jump to the subset A at the next step. If ¥ is countable, then P(x, {i}) 
corresponds to the transition probability pyi of the discrete Markov chains 
of Section 8. But on a general state space, we may have P(x, {i}) = 0 for 
allie X. 

We also require an initial distribution v, which is any probability dis- 
tribution on (¥, F). The transition probabilities and initial distribution 
then give rise to a (discrete-time, general state space, time-homogeneous) 
Markov chain Xo, X1, X2, ..., where 


P(Xo = Ao, Xı € Aj,...,Xn € An) 
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= J v(dzo) f P(zo,dx1)... 
zoEAo zE 

Af Plas yd) PG As. GRR 
@n—-1E€An-1 


Note that these integrals (i.e., expected values) are well-defined because of 
condition (A2) above. 

As before, we shall write P,(---) for the probability of an event condi- 
tional on Xo = 2, i.e. under the assumption that the initial distribution 
v is a point-mass at the point x. And, we define higher-order transi- 
tion ee inductively by P!(x,A) = P(x, A), and P"+!(z,A) = 
Jy P(x, dz)P"(z,A) forn > 1. 

Anloo to the countable state space case, we define a stationary dis- 
tribution 7 i Markov chain to be a probability measure 7(-) on (¥, F), 
such that 7(A) = fy (dx) P(a, A) for all A € F. (This generalises our ear- 
lier pe es a g 7 Pij-) As in the countably infinite case, Markov 
chains on general state spaces may or may not have stationary distributions. 


Example 15.2.2. Consider the Markov chain on the real line (i.e. with 
X = R), where P(x,-) = N(%, #) for each x € X. In words, if Xn is equal 
to some real aries x, then the condone! distribution of Xn+1 will be 
normal, with mean > and variance 3. Equality, Xn+1 = 5Xn + Un41, 
where {Un} are iid. with Un ~ N(0,3). This example is analysed in 
Exercise 15.2.5 below; in particular, it is shown that 7(-) = N(0,1) is a 
stationary distribution for this chain. 


For countable state spaces S, we defined irreducibility to mean that for 
all 1,7 € S, there isn € N with pie) > 0. On uncountable state spaces this 
definition is of limited use, since we will often (e.g. in the above example) 
have pi? = 0 for all 1,7 € ¥ and all n > 1. Instead, we say that a Markov 
chain on a general state space ¥ is ¢-irreducible if there is a non-zero, o- 
finite measure Y on (X, F) such that for any A € F with w(A) > 0, we 
have P,(t4 < 00) > 0 for all z € XY. (Here 74 = inf{n > 0; Xn € A} is the 
first hitting time of the subset A; thus, 74 < œ is the event that the chain 
eventually hits the subset A, and ¢-irreducibility is the statement that the 
chain has positive probability of eventually hitting any subset A of positive 
p measure.) 

Similarly, on countable state spaces S we defined aperiodicity to mean 
that for all i € S, ged{n > 1: pm > 0} = 1, but on uncountable state 


spaces we will usually have pe = 0 for all n. Instead, we define the period 
of a general-state-space Markov chain to be the largest (finite) positive 
integer d such that there are non-empty disjoint subsets 4,...,%q C X, 
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with P(x, ¥j41) = 1 for all z € X; (1 < i < d— 1) and P(x, 1) = 1 for all 
x € Xa. The chain is periodic if its period is greater than 1, otherwise the 
chain is aperiodic. 

In terms of these definitions, a fundamental theorem about general state 
space Markov chains is the following generalisation of Theorem 8.3.10. For 
a proof see e.g. Meyn and Tweedie (1993). 


Theorem 15.2.3. Ifa discrete-time Markov chain on a general state space 
is @-irreducible and aperiodic, and furthermore has a stationary distribution 
m(-), then for n-almost every x € X, we have that 


lim sup |P”(xz, A) —7(A)| —> 0. 
n> ACF 


In words, the Markov chain converges to its stationary distribution in the 
total variation distance metric. 


Exercise 15.2.4. | Consider a Markov chain which is ¢-irreducible with 
respect to some non-zero o-finite measure 7, and which is periodic with 
corresponding disjoint subsets %,...,%y. Let B = Uj; %. 

(a) Prove that P(x, BC) = 0 for all x € B. 

(b) Prove that ¥(B°) = 0. 

(c) Prove that ~(4;) > 0 for some i. 


Exercise 15.2.5. Consider the Markov chain of Example 15.2.2. 

(a) Let m(-) = N(0,1) be the standard normal distribution. Prove that 
m(-) is a stationary distribution for this Markov chain. 

(b) Prove that this Markov chain is ¢-irreducible with respect to A, where 
à is Lebesgue measure on R. 

(c) Prove that this Markov chain is aperiodic. [Hint: Don’t forget Exer- 
cise 15.2.4.] 

(d) Apply Theorem 15.2.3 to this Markov chain, writing your conclusion 
as explicitly as possible. 


Exercise 15.2.6. (a) Prove that a Markov chain on a countable state 
space Æ is ¢-irreducible if and only if there is 7 € Æ such that Pi(1; < 
oo) > 0 for allie X. 

(b) Give an example of a Markov chain on a countable state space which 
is d-irreducible, but which is not irreducible in the sense of Subsection 8.2. 


Exercise 15.2.7. Show that, for a Markov chain on a countable state 
space S, the definition of aperiodicity from this subsection agrees with the 
previous definition from Definition 8.3.4. 
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Exercise 15.2.8. Consider a discrete-time Markov chain with state space 
Æ = R, and with transition probabilities such that P(x, -) is uniform on the 
interval [z —1, x + 1]. Determine whether or not this chain is ¢-irreducible. 


Exercise 15.2.9. Recall that counting measure is the measure ~(-) de- 
fined by Y(A) = |A], i.e. w(A) is the number of elements in the set A, with 
wW(A) = oo if the set A is infinite. 

(a) For a Markov chain on a countable state space ¥, prove that: irre- 
ducibility in the sense of Subsection 8.2 is equivalent to ¢-irreducibility 
with respect to counting measure on ¥. 

(b) Prove that the Markov chain of Exercise 15.2.5 is not ¢-irreducible with 
respect to counting measure on R. (That is, prove that there is a set A, 
and z € 4, such that P;(ta < oo) = 0, even though (A) > 0 where w is 
counting measure.) 

(c) Prove that counting measure on R is not o-finite (cf. Remark 4.4.3). 


Exercise 15.2.10. | Consider the Markov chain with ¥ = R, and with 
P(a,-) = N(a,1) for each x € ¥. 

(a) Prove that this chain is ¢-irreducible and aperiodic. 

(b) Prove that this chain does not have a stationary distribution. Relate 
this to Theorem 15.2.3. 


Exercise 15.2.11. Let ¥ = {1,2,...}. Let P(1,{1}) = 1, and for x > 2, 
P(x, {1}) = 1/2? and P(x, {x + 1}) = 1 — (1/z?). 

(a) Prove that this chain has stationary distribution m(-) = 6,(-). 

(b) Prove that this chain is ¢-irreducible and aperiodic. 

(c) Prove that if Xo = x > 2, then P[X, = x +n for all n] > 0, and 
||P" (x, -) — r(-)|| 4 0. Relate this to Theorem 15.2.3. 


Exercise 15.2.12. Show that the finite-dimensional distributions im- 
plied by (15.2.1) satisfy the two consistency conditions of the Kolmogorov 
Existence Theorem. What does this allow us to conclude? 


15.3. Continuous-time Markov processes. 


In general, a continuous-time stochastic process is a collection {X¢}1>0 
of random variables, defined jointly on some probability triple, taking values 
in some state space V with o-algebra F, and indexed by the non-negative 
real numbers T = {t > 0}. Usually we think of the variable t as representing 
(continuous) time, so that X; is the (random) state at some time t > 0. 
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Such a process is a (continuous-time, time-homogeneous) Markov process 
if there are transition probabilities P'(x,-) for all t > 0 and all x € 4, and 
an initial distribution v, such that 


P(Xo € Ao, Xt, E€ Áı,.. c3 Xt, € An) 


= / I sll v(dzo) 
zoEAo Tt EAI Ltn €An 


P (xo, day, ) PH (x1, d£) --. Pt- (£t dx, ) (15.3.1) 


for all times 0 < tı < ... < tn and all subsets A1,..., An E F. Letting 
P?(x,-) be a point-mass at x, it then follows that 


P+ (x, A) = J æd) Pw, A), s,t>0, 2€X, AEF. (15.3.2) 


(This is the semigroup property of the transition probabilities: Pt! = 
P* P+.) On a countable state space X, we sometimes write pi; for P*(i, {j}); 
(15.3.2) can then be written as pi" = D pex Digpp;- As in (8.0.5), ph = dis, 
(i.e., at time t = 0, Pe, equals 1 for i = j and 0 otherwise). 


Exercise 15.3.3. Let {P*(z, -)} be a collection of Markov transition prob- 
abilities satisfying (15.3.2). Define finite-dimensional distributions /t,...1; 
for t1,...,t, > 0 by Lt...t, (Ao X... X Ak) = P(Xa € Aj, wey ig E Ak) 
as implied by (15.3.1). 

(a) Prove that the Kolmogorov consistency conditions are satisfied by 
{ktit}. [Hint: You will need to use (15.3.2).] 

(b) Apply the Kolmogorov Existence Theorem to {jit,...1,}, and describe 
precisely the conclusion. 


Another important concept for continuous-time Markov processes is the 
generator. If the state space ¥ is countable (discrete), then the generator 
is a matrix Q = (qi;), defined for i, j € Y by 


Pij = big 


ma (15.3.4) 


Since 0 < Pij < 1 for all t > 0 and all ¿ and j, we see immediately that 
qii < 0, while qij > 0 if i Aj. Also, since (pi; — 6:3) = 1 — 1 = 0, we 
have >> jdij = 0 for each 7. The matrix Q represents a sort of “derivative” 
of Pt with respect to t. Hence, Diy ~ 6ij + 8qij for small s > 0, ie. 
P§ =I+sQ+0(s) ass\,0. 


Exercise 15.3.5. Let {P*} = {pj,} be the transition probabilities for 
a continuous-time Markov process on a finite state space 1, with finite 
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generator Q = {qiz} given by (15.3.4). Show that P? = exp(t). ae 
a matrix M, we define exp(M) by exp(M) =I +M+5M?+3M?+ 
where J is the identity matrix.) [Hint: Pt = (P*/")”, and as n — 00, we 
have Pt/” = I + (t/n)Q + O(1/n?).] 


Exercise 15.3.6. Let ¥ = {1,2}, and let Q = (qij) be the generator of 
a continuous-time Markov process on 4, with 


Beg ca) 


Compute the corresponding transition probabilities Pt = (pi;) of the pro- 
cess, for any t > 0. [Hint: Use the previous exercise. It may help to know 
that the eigenvalues of Q are 0 and —9, with corresponding left eigenvectors 
(2,1) and (1, —1).] 


The generator Q may be interpreted as giving “jump rates” for our 
Markov processes. If the chain starts at state i and next jumps to state j, 
then the time before it jumps is exponentially distributed with mean 1/q,;. 
We say that the process jumps from 7 to j at rate qj. Exercise 15.3.5 shows 
that the generator Q contains all the information necessary to completely 
reconstruct the transition probabilities Pt of the Markov process. 

A probability distribution 7(-) on ¥ is a stationary distribution for the 
process {X¢}1>0 if f m(dx) P*(c, ee m(-) for all t > 0; if ¥ is countable we 
can write this as 7; = r Tipi; for all j € ¥ and all t > 0. Note that it 
is no longer sufficient to check this equation for just one particular value of 
t (e.g. t = 1). Similarly, we say that {X:}+>0 is reversible with respect to 
n(-) if n(dx) Pt (x, dy) = n(dy) Pt (y, dx) for all x,y € X and allt > 0. 


Exercise 15.3.7. Let {Xz}:>0 be a Markov process. Prove that 

(a) if {Xz}:>0 is reversible with respect to m(-), then m(-) is a stationary 
distribution for {X¢}¢>0. 

(b) if for some T > 0 we have f (dr) P'(z,-) = n(-) for all 0 < t < T, 
then 7(-) is a stationary distribution ee Pawn 

(c) if for some T > 0 we have n(dx) P*(x,dy) = m(dy) P*(y, dz) for all 
x,y E€ X and all0 <t<T, then {X;}1>0 is reversible with respect to 7(-). 


Exercise 15.3.8. For a Markov chain on a finite state space ¥ with 
generator Q, prove that {a;}:cx is a stationary distribution if and only if 
TQ =0, ie. if and only if } ex Ti qij = 0 for all j € X. 


Exercise 15.3.9. Let {Zn} be iid. ~ Exp(5), so P[Z, > z] = e757 for 
z>0. Let T, = 2) +2Z2+...+Z, for n > 1. Let {X:}+>0 be a continuous- 
time Markov process on the state space Y = {1,2}, defined as follows. The 
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process does not move except at the times Tn, and at each time Tn, the 
process jumps from its current state to the opposite state (i.e., from 1 to 
2, or from 2 to 1). Compute the generator Q for this process. (Hint: You 
may use without proof the following facts, which are consequences of the 
“memoryless property” of the exponential distribution: for any 0 < a < b, 
(i) PGn:a< Tp, < b) = 1 — 75-9), which is = 5(b — a) as b N a; and 
also (ii) Pn: a < Tn < Troi < b) =1~e-5°-9 (1 + 5(b — a)), which is 
~ 25(b— a)? as b \ a] 


Exercise 15.3.10. (Poisson process.) Let A > 0, let {Zn} be iid. 
~ Exp(A), and let T, = Z1 + Z2 +... + Zn for n > 1. Let {Ni}t>0 be a 
continuous-time Markov process on the state space Y = {0,1,2,...}, with 
No = 0, which does not move except at the times Tn, and increases by 1 
at each time Tn. (Equivalently, N; = #{n € N : Ta < t}; intuitively, N: 
counts the number of events by time t.) 

(a) Compute the generator Q for this process. [Don’t forget the hint from 
the previous exercise.] 

(b) Prove that P(N; < m) = e™™(At)®/m! + P(N; < m — 1) for m = 
0,1,2,.... [Hint: First argue that P(N; < m) = P(Tin4i > t) and that 
Tm ~ Gamma(m, à), and then use integration by parts; you may assume 
the result and hints of Exercise 9.5.17.] 

(c) Conclude that P(N; = j) = e7% (At) / j!, ie. that N; ~ Poisson(At). 
[Remark: More generally, there are Poisson processes with variable rate 
functions A : [0, o0) — [0, 00), for which N; ~ Poisson( SN s)ds).] | 


Exercise 15.3.11. Let {N:}:>0 be a Poisson process with rate À > 0. 
(a) For 0 < s < t, compute P(N, =1|N; = 1). 

(b) Use this to specify presiely the conditional distribution of the first 
event time 71, conditional on N; = 1. 

(c) More generally, for r < t and m € N, specify the conditional distribu- 
tion of Tm, conditional on N, = m — 1 and N; =m 


Exercise 15.3.12. (a) Let {Ni}:>0 be a Poisson process with rate 
A > 0, let 0 < s <t, and let U1, U2 be iid. ~ Uniform(0, t]. 

(a) Compute P(N, = 0| N; = 2). 

(b) Compute P(U; > sand U2 > 8). 

(c) Compare the answers to parts (a) and (b). What does this comparison 
seem to imply? 
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15.4. Brownian motion as a limit. 


Having discussed continuous-time processes on countable state spaces, 
we now turn to continuous-time processes on continuous state spaces. Such 
processes can move in continuous curves and therefore give rise to interesting 
random functions. The most important such process is Brownian motion, 
also called the Wiener process. Brownian motion is best understood as a 
limit of discrete time processes, as follows. 

Let Z1, Z2,... be any sequence of i.i.d. bounded random variables having 
mean 0 and variance 1, e.g. Z; = +1 with probability 1/2 each. For each n € 
N, define a discrete-time random process fy” £o iteratively by yi” = 0, 
and i 4 

Yoo = Yy 4+ Faint, F= 01,2000, (15.4.1) 
Thus, y”) = F(Z + Z2 +... + Zi). In particular, {Y0} is like simple 
symmetric random walk, except with time sped up by a factor of n, and 
space shrunk by a factor of yn. 


p 


Figure 15.4.2. Constructing Brownian motion. 


We can then “fill in” the missing values by linear interpolation on each in- 


terval [Ż, +++]. In this way, we obtain a continuous-time process (Vo Vise 
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see Figure 15.4.2. Thus, wv }z>0 is a continuous-time process which agrees 
with y”) whenever t = +. Furthermore, y is always within O(1/n) of 


n 
yi?) , where |r| is the greatest integer not exceeding r. Intuitively, then, 


Tai is like an ordinary discrete-time random walk, except that it has 
been interpolated to a continuous-time process, and furthermore time has 
been sped up by a factor of n and shrunk by a factor of yn. In summary, 
this process takes lots and lots of very small steps. 


Remark. It is not essential that we choose the missing values so as to make 
the function linear on [£, ++]. The same intuitive limit will be obtained 


no matter how the missing values are defined, provided only that they are 
close (as n — o0) to the corresponding Yi values. For example, another 
option is to use a constant interpolation of the missing values, by setting 
y” = Vi In fact, this constant interpolation would make the following 
calculations cleaner by eliminating the O(1/n) errors. However, the linear 


interpolation has the advantage that each yo is then a continuous function 
of t, which makes it more intuitive why B; is also continuous. 


Now, the factors n and „y/n have been chosen carefully. We see that 
y” N YER, = F(Z + Z2 +... + Zitnj). Thus, y”) is essentially Fr 
times the sum of |tn] different i.i.d. random variables, each having mean 0 
and variance 1. It follows from the ordinary Central Limit Theorem that, 
as n — oo with t fixed, we have £ (ee = N(0,t), i.e. for large n the 


random variable y” is approximately normal with mean 0 and variance t. 
Let us note a couple of other facts. Firstly, for s < t, E teori x% 


lE (viv) = E ((Zı +... + Zsn])(Zı +---+Zinj))- Since the 
Z; have mean 0 and variance 1, and are independent, this is equal to {esl 
which converges to $ as n — oo. 

Secondly, if n — œo with s < t fixed, the difference y” -y™ is within 
O(1/n) of Zlsnj+1 +--- + Zltnje Hence, by the Central Limit Theorem, 
y™ — y” converges weakly to N(0,t — s) as n — oo, and is in fact 
independent of y”. 

Brownian motion is, intuitively, the limit as n — oo of the processes 
e; That is, we can intuitively define Brownian motion {B+ }r>0 by 
saying that B, = limp y™ for each fixed t > 0. This analogy is very 


useful, in that all the n — oo properties of y™ discussed above will apply 
to Brownian motion. In particular: 
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e normally distributed: £(B,) = N(0,t) for any t > 0. 
© covariance structure: E(B,B;) =s forO<s <t. 


e independent normal increments: L(Bi, —Bi,) = N(0,te—t1), and Bi, — 
Bi, is independent of Bi, — B, whenever 0 < ti < tg < t3 < t4. 

In addition, Brownian motion has one other very important property. 
It has continuous sample paths, meaning that for each fixed w € Q, the 
function B(w) is a continuous function of t. (However, it turns out that, 
with probability one, Brownian motion is not differentiable anywhere at all!) 


Exercise 15.4.3. Let {B;}:>0 be Brownian motion. Compute E[(B2 + 
B; + 1)°?]. 


Exercise 15.4.4. Let {B:}:50 be Brownian motion, and let X; = 2t+3B; 
for t > 0. 

(a) Compute the distribution of X; for t > 0. 

(b) Compute E[(X;)?] for t > 0. 

(c) Compute E(X,X;) for0<s <t. 


Exercise 15.4.5. Let {B:}:>0 be Brownian motion. Compute the dis- 
tribution of Zn for n € N, where Z, = +(B,+Bo+...+ Bn). 


Exercise 15.4.6. (a) Let f: R — R be a Lipschitz function, ie. a 
function for which there exists a € R such that |f(x) — f(y)| < a|x—y| for 
all x,y E€ R. Compute lima\o(f(t + h) — f(t))?/h for any t € R. 

(b) Let {B:} be Brownian motion. Compute limano E ((Bi+n — By)?/h) 
for any t > 0. 

(c) What do parts (a) and (b) seem to imply about Brownian motion? 


15.5. Existence of Brownian motion. 


We thus see that Brownian motion has many useful properties. However, 
we cannot make mathematical use of these properties until we have estab- 
lished that Brownian motion ezists, i.e. that there is a probability triple 
(Q,F,P), with random variables {B;};>09 defined on it, which satisfy the 
properties specified above. 

Now, the Kolmogorov Existence Theorem (Theorem 15.1.3) is of some 
help here. It ensures the existence of a stochastic process having the same 
finite-dimensional distributions as our desired process {B+}. At first glance 
this might appear to be all we need. Indeed, the properties of being nor- 
mally distributed, of having the right covariance structure, and of having 
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independent increments, all follow immediately from the finite-dimensional 
distributions. 

The problem, however, is that finite-dimensional distributions cannot 
guarantee the property of continuous sample paths. To see this, let {B:}:>0 
be Brownian motion, and let U be a random variable which is distributed 
according to (say) Lebesgue measure on [0,1]. Define a new stochastic 
process {Bi}i>0 by By = By + L{u=}. That is, B; is equal to B;, except 
that if we happen to have U = ¢ then we add one to B;. Now, since 
P(U = t) = 0 for any fixed t, we see that {Bj}:>0 has exactly the same finite- 
dimensional distributions as {.B;}:>9 does. On the other hand, obviously if 
{B:}:>0 has continuous sample paths, then {Bi }:>09 cannot. This shows that 
the Kolmogorov Existence Theorem is not sufficient to properly construct 
Brownian motion. Instead, one of several alternative approaches must be 
taken. 

In the first approach, we begin by setting Bp = 0 and choosing Bı ~ 
N(0,1). We then define Bi to have its correct conditional distribution, 
conditional on the values of Bo and B,. Continuing, we then define B, 
to have its correct conditional distribution, conditional on the values of Bo 
and B 4 and similarly define Bs conditional on the values of Bi and Bı. 
Iteratively, we see that we can define B: for all values of t which are dyadic 
rationals, i.e. which are of the form 57 for some integers i and n. We then 
argue that the function {B,; t a non-negative dyadic rational} is uniformly 
continuous, and use this to argue that it can be “filled in” to a continuous 
function {B,; t > 0} defined for all non-negative real values t. For the 
details of this approach, see e.g. Billingsley (1995, Theorem 37.1). 


In another approach, we define the processes (yo) as above, and ob- 
serve that y” is a (random) continuous function of t. We then prove that 


as n — oo, the distributions of the functions ais ees (regarded as prob- 
ability distributions on the space of all continuous functions) are tight, and 
in fact converge to the distribution of {B:}:>0 that we want. We then use 
this fact to conclude the existence of the random variables {B;}:>0. For 
details of this approach, see e.g. Fristedt and Gray (1997, Section 19.2). 


Exercise 15.5.1. Construct a process {B/’};>0 which has the same 
finite-dimensional distributions as does {.B;}:>0 and {Bi }:>0, but such that 
neither {B/’}:>0 nor { Bi’ — Bj}:>0 has continuous sample paths. 


Exercise 15.5.2. Let {X:}ter and {Xi}ter be a stochastic processes 
with the countable time index T. Suppose {X¢}ier and {Xj }:er have iden- 
tical finite-dimensional distributions. Prove or disprove that {X:}:er and 
{Xi her must have the same full joint distribution. 
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15.6. Diffusions and stochastic integrals. 


Suppose we are given continuous functions p,¢ : R — R, and we replace 
equation (15.4.1) by 


i 


(nr) _ y(n), 1, (n)\ l, (y(n) 
YE) = YP + Gane (vs )+ =a (Y! ). (15.6.1) 


That is, we multiply the effect of Zi+ı by o(¥{”), and we add in an extra 


drift of lay ™). If we then interpolate this function to a function vey 
defined for all t > 0, and let n — oo, we obtain a diffusion, with drift u(x), 
and diffusion coefficient or volatility a(x). Intuitively, the diffusion is like 
Brownian motion, except that its mean is drifting according to the function 
p(x), and its local variability is scaled according to the function o(z). In 
this context, Brownian motion is the special case u(x) = 0 and o(x) = 1. 

Now, for large n we see from (15.4.1) that Fz i41 © Bua — Bi, 80 
that (15.6.1) is essentially saying that 


(?)_ ym) ~ [Ba RB: mj, 1 (n) 
rR- YP ~ (Ban Bs) o(y$ )+= a(rį hi 
If we now (intuitively) set X, = limp—oo yo for each fixed t > 0, then 
we can write the limiting version of (15.6.1) as 


where {B;}:>0 is Brownian motion. The process {X;}:>0 is called a dif- 
fusion with drift u(x) and diffusion coefficient or volatility a(x). Its def- 
inition (15.6.2) can be interpreted roughly as saying that, as h \, 0, we 
have 

Xt+h x% Xt + o(Xt)(Bitn = B:) + U(X) h; (15.6.3) 


thus, given that X; = z, we have that approximately X:n ~ N(x + 
u(x)h, o?(x)h). If u(x) = 0 and o(x) = 1, then the diffusion coincides 
exactly with Brownian motion. 

We can also compute a generator for a diffusion defined by (15.6.2). To 
do this we must generalise the matrix generator (15.3.4) for processes on 
finite state spaces. We define the generator to be an operator Q acting on 
smooth functions f : R — R by 


(Q P(e) = lim Ea (F(X) -AXD (15.6.4) 


where E, means expectation conditional on X; = z. That is, Q maps one 
smooth function f to another function, Q f. 
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Now, for a diffusion satisfying (15.6.3), we can use a Taylor series ex- 
pansion to approximate f(X;4;) as 


1 
F(Xttn) = F(X) + (Xin -Xe) f (Xe) + 3 (Xtpn—Xz)? f" (Xz). (15.6.5) 
On the other hand, 
Xt+h pai Xi xX o(X:)(Bi+hr — B;) + p(X )h ‘ 


Conditional on X; = z, the conditional distribution of Bi, — B; is normal 
with mean 0 and variance h; hence, the conditional distribution of Xi+r — X: 
is approximately normal with mean p(x) h and variance o?(x) h. So, taking 
expected values conditional on X; = x, we conclude from (15.6.5) that 


Be [f Xian) = FX) & pahf (e) + 5 [le)h)? + o?e] Fa) 


= plang" x) + Z0?(2)hf" (E) + OH). 


It then follows from the definition (15.6.4) of generator that 


(QP) = Ma) Pa) + Z0) #2). (15.6.6) 


We have thus given an explicit formula for the generator Q of a diffusion 
defined by (15.6.2). Indeed, as with Markov processes on discrete spaces, 
diffusions are often characterised in terms of their generators, which again 
provide all the information necessary to completely reconstruct the process’s 
transition probabilities. 

Finally, we note that Brownian motion can be used to define a certain 
unusual kind of integral, called a stochastic integral or Itô integral. Let 
{B:}:>0 be Brownian motion, and let {C;}:>0 be a non-anticipative real- 
valued random variable (i.e., the value of C; is conditionally independent 
of {B,}ss2, given {Bs}s<t). Then it is possible to define an integral of the 
form f? C; dB;, i.e. an integral “with respect to Brownian motion”. Roughly 
speaking, this integral is defined by 


m-—l 


b 
/ CrdB, = lim Y Cu (Bun — Ba), 
a i=0 


where a = to < tı < ... < tm = b, and where the limit is over finer 


and finer partitions {(tj~1,ti|} of (a, b]. Note that this integral is random 
both because of the randomness of the values of the integrand Cr, and also 
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because of the randomness from Brownian motion B; itself. In particular, 
we now see that (15.6.2) can be written equivalently as 


b b 
Xp — Xa = I o(X1) dB. + f p(X) dt, 


where the first integral is with respect to Brownian motion. Such stochastic 
integrals are used in a variety of research areas; for further details see e.g. 
Bhattacharya and Waymire (1990, Chapter VII), or the other references in 
Subsection B.5. 


Exercise 15.6.7. Let {Xt}:>0 be a diffusion with constant drift u(x) = 
and constant volatility o(z) = b > 0. 

(a) Show that X; = at +b By. 

(b) Show that £(X,) = N(at, b*t). 

(c) Compute E(X, X+) for0<s<t. 


Exercise 15.6.8. Let {B:}:>0 be standard Brownian motion, with 
Bo = 0. Let X, = fo sds + fi bB ds = at + bB; be a diffusion with 
constant drift p(x ) = a > 0 and constant volatility o(z) = b > 0. Let 
Z, = exp | — (a+ 5b°)t + X]. 

(a) Prove that eee is a martingale, i.e. that E[Z:| Zu (0 < u < 8)] = Zs 
forO<s<t. 

(b) Let A,B > 0, and let T4 = inf{t > 0: X, = A} and T_p = inf{t > 
0: X: = —B} denote the first hitting times of A and —B, respectively. 
Compute P(T,4 < Ts). [Hint: Use part (a); you may assume that Corol- 
lary 14.1.7 also applies in continuous time.] 


Exercise 15.6.9. Let g: R — R be a smooth function with g(x) > 0 
and f g(x)drx = 1. Consider the diffusion {X+} defined by (15.6.2) with 
a(x) =1 = p(x) = 59'(x)/g(x). Show that the probability distribution 
having density g (with ae to Lebesgue measure on R) is a stationary 
distribution for the diffusion. [Hint: Show that the diffusion is reversible 
with respect to g(x) dz. You may use (15.6.3) and Exercise 15.3.7.] This 
diffusion is called a Langevin diffusion. 


Exercise 15.6.10. Let {pi;} be the transition probabilities for a Markov 
chain on a finite state space X. Define the matrix Q = (qj) by (15.3.4). 
Let f : Æ — R be any function, and let i € X. Prove that (Qf); (ie., 
X pex dik fk) corresponds to (Q f)(z) from (15.6.4). 


Exercise 15.6.11. Suppose a diffusion {X;};>0 satisfies that 


(Qf)(x) = 17 f(x) +232? f"(x), 
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for all smooth f : R — R. Compute the drift and volatility functions for 
this diffusion. 


Exercise 15.6.12. Suppose a diffusion {X;}:>0 has generator given 
by (Qf)(z) = -4 f'(x) + $f"(x). (Such a diffusion is called an Ornstein- 
Uhlenbeck process.) 

(a) Write down a formula for dX;. (Hint: Use (15.6.6).] 

(b) Show that {X:}+>0o is reversible with respect to the standard normal 
distribution, N(0,1). [Hint: Use Exercise 15.6.9.] 


15.7. It6’s Lemma. 


Let {Xz}:>0 be a diffusion satisfying (15.6.2), and let f : R. — R bea 
smooth function. Then we might expect that {f(Xz)}:>0 would itself be a 
diffusion, and we might wonder what drift and volatility would correspond 
to {f(Xt)}e>0. Specifically, we would like to compute an expression for 


d (F(X+)). 


To that end, we consider small h = 0, and use the Taylor approximation 
1 
f (Xen) = F(X) + (Xt) (Xtra — Xe) + af" (Xt) (Xtan — X+)? + 0(h). 


Now, in the classical world, an expression like (X;4+, — X+)? would be O(h?) 
and hence could be neglected in this computation. However, we shall see 
that for Itô integrals this is not the case. 

We continue by writing 


t+h t+h 
Xtth = Xt = J o(X,)dB, + f u(Xs)ds 
t t 


Z a(Xt) (Bi+h = B;) + p(X) h. 


Hence, 
f(Xezn) = F(X) + F'(X) [o(Xt) (Bern — Be) + w(X) h) 
+SS OXD [o(Xe) (Ben = Ba) + HXi) h]? + olh). 


In the expansion of (o(X;)(Bi+n — By) + u(X+)h)?, all terms are o(h) aside 
from the first, so that 


f(Xean) © F(X) + F'(X) [o( Xt) (Betn — Bt) + w(Xe) h] 


+3 OG) [ol Xe) (Bran — By)? + ofh). 
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But Bin — Bi ~ N(0,h), so that E [(Bitn cae Bi)? | = h, and in fact (Bryn — 
B)? = h+o(h). (This is the departure from classical integration theory; 
there, we would have (Bi, — By)? = o(h).) Hence, as h > 0, 


f(Xtan) © F(X) + F (Xe) [o( Xt) (Bern — Be) + uO) h] 


+IP) 07(Xi)h + of). 


We can write this in differential form as 


UXD) = PX) OX) dB, + MX) df] + EF") 0? (Xa) dt, 


or 


UXD) = Fe) 01%) Be + (PX MX) + ZPO X) ) at. 


(15.7.1) 
That is, {f(X;)} is itself a diffusion, with new drift G(x) = f'(x) u(x) + 
$f" (x) o?(z) and new volatility F(x) = f'(x) a(x). 

Equation (15.7.1) is It6’s Lemma, or Itô’s formula. It is a generalisation 
of the classical chain rule, and is very important for doing computations with 
It6 integrals. By (15.6.2), it implies that d(f(X¢)) is equal to f’( Xz) dX; + 
3 f" (Xt) 0?(Xz) dt. The extra term 4f” (X+) o?(X+) dt arises because of the 
unusual nature of Brownian motion, namely that (By, — B;)? œ h instead 
of being O(h?). (In particular, this again indicates why Brownian motion 
is not differentiable; compare Exercise 15.4.6.) 


Exercise 15.7.2. Consider the Ornstein-Uhlenbeck process {X;} of 
Exercise 15.6.12, with generator (Qf)(x) = —4 f'(x) + 4 f” (£). 

(a) Let Y; = (X)? for each t > 0. Compute dY;. 

(b) Let Z; = (X+)? for each t > 0. Compute dZ. 


Exercise 15.7.3. Consider the diffusion {X+} of Exercise 15.6.11, with 
generator (Of)(x) = 17 F'(X) + 23( Xz)? f" (Xz). Let W: = (X1) for each 
t > 0. Compute dW. 


15.8. The Black-Scholes equation. 


In mathematical finance, the prices of stocks are often modeled as diffu- 
sions. Various calculations, such as the value of complicated stock options 
or derivatives, then proceed on that basis. 

In this final subsection, we give a very brief introduction to financial 
models, and indicate the derivation of the famous Black-Scholes equation 
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for pricing a certain kind of stock option. Our treatment is very cursory; 
for more complete discussion see books on mathematical finance, such as 
those listed in Subsection B.6. 

For simplicity, we suppose there is a constant “risk-free” interest rate r. 
This means that an amount of money M at present is worth eM at a time 
t later. Equivalently, an amount of money M a time t later is worth only 
e"'M at present. 


We further suppose that there is a “stock” available for purchase. This 
stock has a purchase price P, at time t. It is assumed that this stock price 
can be modeled as a diffusion process, according to the equation 


dP, = bP, dt + oP, dB; , (15.8.1) 


i.e. with u(x) = bz and a(x) = ox. Here B; is Brownian motion, b is the 
appreciation rate, and ø is the volatility of the stock. For present purposes 
we assume that b and o are constant, though more generally they (and also 
r) could be functions of t. 


We wish to find a value, or fair price, for a “European call option” (an 
example of a financial derivative). A European call option is the option to 
purchase the stock at a fixed time T > 0 for a fixed price g > 0. Obviously, 
one would exercise this option if and only if Pr > q, and in this case 
one would gain an amount Pr — q. Therefore, the value at time T of the 
European call option will be precisely max{0, Pr — q); it follows that the 
value at time 0 is e~’? max(0, Pr — q). The problem is that, at time 0, the 
future price Pr is unknown (i.e. random). Hence, at time 0, the true value 
of the option is also unknown. What we wish to compute is a fair price for 
this option at time 0, given only the current price Po of the stock. 


To specify what a fair price means is somewhat subtle. This price is 
taken to be the (unique) value such that neither the buyer nor the seller 
could strictly improve their payoff by investing in the stock directly (as op- 
posed to investing in the option), no matter how sophisticated an investment 
strategy they use (including selling short, i.e. holding a negative number of 
shares) and no matter how often they buy and sell the stock (without being 
charged any transaction commissions). 


It turns out that for our model, the fair price of the option is equal to 
the expected value of the option given Po, but only after replacing b by r, 
i.e. after replacing the appreciation rate by the risk-free interest rate. We 
do not justify this fact here; for details see e.g. Theorem 1.2.1 of Karatzas, 
1997. The argument involves switching to a different probability measure 
(the risk-neutral equivalent martingale measure), under which {B+} is still 
a (different) Brownian motion, but where (15.8.1) is replaced by dP, = 
rP,dt + o Pd B. 
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We are thus interested in computing 
E (e "T max(0, Pr — q) | Po) , (15.8.2) 
under the condition that 
b is replaced by r. (15.8.3) 
To proceed, we use Itô’s Lemma to compute d (log P,). This means that 
we let f(x) = logg in (15.7.1), so that f'(x) = 1/x and f"(x) = —1/z?. 


We apply this to the diffusion (15.8.1), under the condition (15.8.3), so that 
u(x) = ra and g(x) = ox. We compute that 


1 1 1 1 
t 


l 2 
= odB;+ (r-50 ) at. 
Since these coefficients are all constants, it now follows that 
1 
log (Pr/P) = log Pr -log Py = (Br ~ Bo) + (r~ 50°) (7-0), 


whence 
1 2 
Pr = P exp( o(Br—Bo)+ (r- 3° Ne ; 


Recall now that Br — Bo has the normal distribution N(0,T), with 
density function -= men 2T. Hence, by Proposition 6.2.3, we can compute 
the value (15.8.2) of the European call option as 


1 1 2 
e-"? max (o Po ex (ox+ r— -0 r) = ) ene NT dz. 
E oe ( 2 ) 1) ant 


(15.8.4) 
After some re-arranging and substituting, we simplify this integral to 


P, a (Fe (toe Pa/a) + Te + 507))) 


1 1 
— qe "TË — (lo Po/q) + T(r — <0? )) 15.8.5 
geta (Fa (low(Pa/a) + Tir- 50”) (15.8.5) 
where ®(z) = a J i e~* /2ds is the usual cumulative distribution func- 
tion of the standard normal. 

Equation (15.8.5) thus gives the time-0 value, or fair price, of a European 
call option to buy at time T > 0 for price q > 0 a stock having initial price 
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Po and price process {P;}:>0 governed by (15.8.1). Equation (15.8.5) is the 
well-known Black-Scholes equation for computing the price of a European 
call option. Recall that here r is the risk-free interest rate, and o is the 
volatility. Furthermore, the appreciation rate b does not appear in the final 
formula, which is advantageous because b is difficult to estimate in practice. 

For further details of this subject, including generalisations to more com- 
plicated financial models, see any introductory book on the subject of math- 
ematical finance, such as those listed in Subsection B.6. 


Exercise 15.8.6. Show that (15.8.4) is indeed equal to (15.8.5). 


Exercise 15.8.7. Consider the price formula (15.8.5), with r, o, T, and 
Po fixed positive quantities. 

(a) What happens to the price (15.8.5) as q N 0? Does this result make 
intuitive sense? 

(b) What happens to the price (15.8.5) as q — 00? Does this result make 
intuitive sense? 


Exercise 15.8.8. Consider the price formula (15.8.5), with r, o, Po, and 
q fixed positive quantities. 

(a) What happens to the price (15.8.5) as T N 0? [Hint: Consider sepa- 
rately the cases q > Po, q = Po, and q < Po.] Does this result make intuitive 
sense? 

(b) What happens to the price (15.8.5) as T — o0? Does this result make 
intuitive sense? 


15.9. Section summary. 


This section considered various aspects of the theory of stochastic pro- 
cesses, in a somewhat informal manner. All of the topics considered can be 
thought of as generalisations of the discrete-time, discrete-space processes 
of Sections 7 and 8. 

First, the section considered existence questions, and stated (but did not 
prove) the important Kolmogorov Existence Theorem. It then discussed 
discrete-time processes on general (non-discrete) state spaces. It provided 
generalised notions of irreducibility and aperiodicity, and stated a theorem 
about convergence of irreducible, aperiodic Markov chains to their station- 
ary distributions. Next, continuous-time processes were considered. The 
semigroup property of Markov transition probabilities was presented. Gen- 
erators of continuous-time, discrete-space Markov processes were discussed. 

The section then focused on processes in continuous time and space. 
Brownian motion was developed as a limit of random walks that take smaller 
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and smaller steps, more and more frequently; the normal, independent in- 
crements of Brownian motion then followed naturally. Diffusions were then 
developed as generalisations of Brownian motion. Generators of diffusions 
were described, as were stochastic integrals. 

The section closed with intuitive derivations of It6’s Lemma, and of 
the Black-Scholes equation from mathematical finance. The reader was 
encouraged to consult the books of Subsections B.5 and B.6 for further 
details. 
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A. Mathematical Background. 


This section reviews some of the mathematics that is necessary as a 
prerequisite to understanding this text. The material is reviewed in a form 
which is most helpful for our purposes. Note, however, that this material 
is merely an outline of what is needed; if this section is not largely review 
for you, then perhaps your background is insufficient to follow this text 
in detail, and you may wish to first consult references such as those in 
Subsection B.1. 


A.1. Sets and functions. 


A fundamental object in mathematics is a set, which is any’ collection 
of objects. For example, {1,3,4,7}, the collection of rational numbers, the 
collection of real numbers, and the empty set @ containing no elements, are 
all sets. Given a set, a subset of that set is any subcollection. For example, 
{3,7} is a subset of {1,3,4,7}; in symbols, {3,7} G {1,3,4,7}. 

Given a subset, its complement is the set consisting of all elements of a 
“universal” set which are not in the subset. In symbols, if A C Q, where Q 
is understood to be the universal set, then AT = {w € Q;w ¢ A}. 

Given a collection of sets {Aa }aez (where J is any indicator set; perhaps 
I = {1,2}), we define their union to be the set of all elements which are in 
at least one of the Aa; in symbols, |J, Aa = {w; w E Ag for some a € I}. 
Similarly, we define their intersection to be the set of all elements which are 
in all of the Aa; in symbols, f, Ag = {w; w € Aa for all a € I}. 

Union, intersection, and complement satisfy de Morgan’s Laws, which 
state that (Uj, Aal =), AG and (Na Aa)? = Ua AG. In words, comple- 
ment converts unions to intersections and vice-versa. 

A collection {Ag}aer of subsets of a set Q are disjoint if the intersection 
of any pair of them is the empty set; in symbols, if Ag, N Aa, = Ý whenever 
Qt, 02 E I are distinct. The collection {Ag}aer is called a partition of Q 
if {Ag}aer is disjoint and also U, Aa = Q. If {Aa}aer are disjoint, we 
sometimes write their union as Ù Áa (e.g. Aa, Ù Aaz) and call it a disjoint 


acl 
union. 


We also define the difference of sets A and B by A\ B = AN B®; for 
example, {1,3,4,7} \ {3,7,9} = {1, 4}. 

Given sets Qı and Q2, we define their (Cartesian) product to be the set 
of all ordered pairs having first element from Qı and second element from 


“In fact, certain very technical restrictions are required to avoid contradictions such 
as Russell’s paradox. But we do not consider that here. 
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Qz; in symbols, Qı x Qo = {(w1,w2);w1 E 21, we E€ Ne}. Similarly, larger 
Cartesian products []„ Qa are defined. 

A function f is a mapping” from a set 9; to a second set Q2; in symbols, 
f : Qı > Qe. Its domain is the set of points in NQ, for which it is defined, 
and its range is the set of points in Q2 which are of the form f(w1) for some 
wi EDs. 

If f : Qi 3 Q2, and if S1 CQ, and S2 C Ne, then the image of Sı under 
f is given by f(S1) = {we E Qe;we = f(w1) for some wı € Sı}, and the 
inverse image of Sz under f is given by f—1(S2) = {w1 € Qı; f(w1) € So}. 
We note that inverse images preserve the usual set operations, for example 
FT (SLU S2) = f~*(S1) U F71 (Sa); f7*(S1 NA S2) = f7*(S1) A f7*(S2); and 
FSO) = fF(8)P. 

Special sets include the integers Z = {0, 1, —1, 2, —2, 3, . ..}, the positive 
integers or natural numbers N = {1,2,3,...}, the rational numbers Q = 
{@; m,n € Z, n # 0}, and the real numbers R. 

Given a subset S C Q, its indicator function 1g : Q — R is defined by 


1, es 
es Vo eee. 


Finally, the Axiom of Choice states that given a collection {Aq}aer of 
non-empty sets (i.e., Ag # @ for all a), their Cartesian product [], Aa is 
also non-empty. In symbols, 


Il Ay #0 whenever Aa £ Í for each a. (A.1.1) 


That is, it is possible to “choose” one element simultaneously from each of 
the non-empty sets A, — a fact that seems straightforward, but does not 
follow from the other axioms of set theory. 


A.2. Countable sets. 


One important property of sets is their cardinality, or size. A set Q 
is finite if for some n € N and some function f : N — Q, we have 
f ({1,2,...,2}) DQ. A set is countable if for some function f : N —> 2, we 
have f (N) =. (Note that, by this definition, all finite sets are countable. 
As a special case, the empty set @ is both finite and countable.) A set is 
countably infinite if it is countable but not finite. It is uncountable if it is 
not countable. 


“More formally, it is a collection of ordered pairs (wi,w2) E€ 21 x Ne, with w1 € Qı 
and w2 = f(w1) € Qe, such that no wı appears in more than one pair. 
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Intuitively, a set is countable if its elements can be arranged in a se- 
quence. It turns out that countability of sets is very important in prob- 
ability theory. The axioms of probability theory make certain guarantees 
about countable operations (e.g., the countable additivity of probability 
measures) which do not necessarily hold for uncountable operations. Thus, 
it is important to understand which sets are countable and which are not. 

The natural numbers N are obviously countable; indeed, we can take the 
identity function (i.e., f(n) =n for all n € N) in the definition of countable 
above. Furthermore, it is clear that any subset of a countable set is again 
countable; hence, any subset of N is also countable. 

The integers Z are also countable: we can take, say, f(1) = 0, f(2) = 1, 
f(3) = —1, f(4) = 2, f(5) = —2, f(6) = 3, etc. to ensure that f(N) = Z. 
More surprising is 


Proposition A.2.1. Let Q, and Q2 be countable sets. Then their 
Cartesian product 2; x Qz is also countable. 


Proof. Let fı and fo be such that fi(N) = Q; and fo(N) = Q2. 
Then define a new function f : N > Qi x Qz by f(1) = (fi (1), fe(1)), 
F2) = (fi), fa(2)), F) = (fil2), fa(1)), F(4) = (fi), fa(3)), FS) = 


(fi (2), fo(2)), F6) = (A63), fa(1)), FO = (F10), fa(4)), ete. (See Fig- 
ure A.2.2.) Then f(N) = Q, x Q2, as required. | 


N 
(1,4) (2,4) (3,4) (4,4) 
N N 
(1,3) (2,3) (3,3) (4,3) 
(1,2) (2,2) (3,2) (4,2) 
N 


(1,1) (2,1) (3,1) (4,1) 
Figure A.2.2. Constructing f in Proposition A.2.1. 
From this proposition, we see that e.g. N x N is countable, as is Z x 


Z. It follows immediately (and, perhaps, surprisingly) that the set of all 
rational numbers is countable; indeed, Q is equivalent to a subset of Z x Z 
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if we identify the rational number m/n (in lowest terms) with the element 
(m,n) € Zx Z. It also follows that, given a sequence 91, Qe, ... of countable 
sets, their countable union Q = |; Q: is also countable; indeed, if f;(N) = 
Qi, then we can identify Q with a subset of N x N by the mapping (m,n) > 
Fin(n). 

For another example of the use of Proposition A.2.1, recall that a real 
number is algebraic if it is a root of some non-constant polynomial with 
integer coefficients. 


Exercise A.2.3. Prove that the set of all algebraic numbers is countable. 


On the other hand, the set of all real numbers is not countable. Indeed, 
even the unit interval [0,1] is not countable. To see this, suppose to the 
contrary that it were countable, with f : N — [0,1] such that f(N) = [0,1]. 
Imagine writing each element of [0,1] in its usual base-10 expansion, and 
let d; be the it! digit of the number f(z). Now define c; by: c; = 4 if di = 5, 
while c; = 5 if d; #5. (In particular, c; Æ d;i.) Then let £ = paras c,107* 
(so that the base-10 expansion of x is 0.cic2c3...). Then z differs from f (i) 
in the ¿t digit of their base-10 expansions, for each 7. Hence, x # f(i) for 
any i. This contradicts the assumption that f(N) = [0, 1]. 


A.3. Epsilons and Limits. 


Real analysis and probability theory make constant use of arguments 
involving arbitrarily small € > 0. A basic starting block is 


Proposition A.3.1. Let a and b be two real numbers. Suppose that, for 
any € > 0, we havea <b+e. Then a < b. 


Proof. Suppose to the contrary that a > b. Let «e = az, Then e > 0, 
but it is not the case that a < b+ e. This is a contradiction. i 


Arbitrarily small € > 0 are used to define the important concept of a 
limit. We say that a sequence of real numbers z1, 22,... converges to the 
real number z (or, has limit x) if, given any € > 0, there is N € N (which 
may depend on €) such that for any n > N, we have |z,, — z| < e. We shall 
also write this as limy;_.o9 Zn = z, and shall sometimes abbreviate this to 
lim, £n = x or even lim £n = xz. We also write this as {Zn} — x. 


Exercise A.3.2. Use the definition of a limit to prove that 
(a) limp 4 = 0. 
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(b) limn—oo se = 0 for any k > 0. 
(c) limps% 2" =1. 


A useful reformulation of limits is given by 


Proposition A.3.3. Let {£n} be a sequence of real numbers. Then 
{zn} > z if and only if for each € > 0, the set {n € N; |zn — z| > e} is 
finite. 


Proof. If {zn} — z, then given € > 0, there is N €e N such that 
|En — z| < e for all n > N. Hence, the set {n; |En — z| > e} contains at 
most N — 1 elements, and so is finite. 

Conversely, given € > 0, if the set {n; |En — z| > e} is finite, then it has 
some largest element, say K. Setting N = K + 1, we see that |x, — z| < € 
whenever n > N. Hence, {£n} > z. E 


Exercise A.3.4. Use Proposition A.3.3 to provide an alternative proof 
for each of the three parts of Exercise A.3.2. 


A sequence which does not converge is said to diverge. There is one 
special case. A sequence {zn} converges to infinity if for all M € R, there 
is N € N, such that £n > M whenever n > N. (Similarly, {£n} converges 
to negative infinity if for all M € R, there is N € N, such that zn < M 
whenever n > N.) This is a special kind of “convergence” which is also 
sometimes referred to as divergence! 

Limits have many useful properties. For example, 


Exercise A.3.5. Prove the following: 

(a) If lim, Zn = z, and a €E R, then lim, az, = az. 

(b) If lim, Zn = z and limn yn = y, then limn (£n + Yn) = T +y. 
(c) If lim, £n = z and limn Yn = y, then limn (tn¥n) = Ty. 

(d) If limn 2, = z, and z £0, then lim, (1/z,) = 1/7. 


Another useful property is given by 


Proposition A.3.6. Suppose £n — x, and Zn < a for alln € N. Then 
La. 

Proof. Suppose to the contrary that z > a. Let € = “5%. Then e > 0, 
but for all n € N we have |£n — T| > £ — £n > £ — a > €, contradicting the 
assertion that £n — T. E 
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Given a sequence {£n}, a subsequence {£n,} is any sequence formed 
from {zn} by omitting some of the elements. (For example, one sub- 
sequence of the sequence (1,3,5,7,9,...) of all positive odd numbers is 
the sequence (3,9,27,81,...) of all positive integer powers of 3.) The 
Bolzano- Weierstrass theorem, a consequence of compactness, says that ev- 
ery bounded sequence contains a convergent subsequence. 

Limits are also used to define infinite series. Indeed, the sum 577°, s; is 
simply a shorthand way of writing limnoo X 4; Si. If this limit exists and 
is finite, then we say that the series wo s; converges; otherwise we say 
the series diverges. (In particular, for infinite series, converging to infinity 
is usually referred to as diverging.) If the s; are non-negative, then J pc] 8: 
either converges to a finite value or diverges to infinity; we write this as 
re si < 00 and SO, 5; = 00, respectively. 

For example, it is not hard to show (by comparing }7>7°,7~¢ to the 
integral {°° ¢~dt) that 


o0 


50:9) < 00 if and only if a>l. (4.3.7) 


i=l 


Exercise A.3.8. Prove that $;<; (1/ilog(i)) = oo. [Hint: Show 
© (1 /ilog(i)) > J2°(de/2 log 2)] 


Exercise A.3.9. Prove that Ð; (1/ilog(i)loglog(i)) = œ, but 
oo . g n2 
en (1 / ilog(i) [log log(i)] ) < 0. 


A.4. Infimums and supremums. 


Given a set {ta}aer of real numbers, a lower bound for them is a real 
number £ such that £a > £ for all œ € I. The set {za}aer is bounded below 
if it has at least one lower bound. A lower bound is called the greatest lower 
bound if it is at least as large as any other lower bound. 

A very important property of the real numbers is: Any non-empty set 
of real numbers which is bounded below has a unique greatest lower bound. 
The uniqueness part of this is rather obvious; however, the existence part 
is very subtle. For example, this property would not hold if we restricted 
ourselves to rational numbers. Indeed, the set {q € Q; q > 0, q? > 2} does 
not have a greatest lower bound if we allow rational numbers only; however, 
if we allow all real numbers, then the greatest lower bound is /2. 

The greatest lower bound is also called the infimum, and is written as 
inf{za;a@ € I} or infer Za. We similarly define the supremum of a set 
of real numbers to be their least upper bound. By convention, for the 
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empty set Ø, we define inf Ô = oo and sup@ = —oo. Also, if a set S is not 
bounded below, then inf S = —oo; similarly, if it is not bounded above, then 
sup S = co. One obvious but useful fact is 


inf S < z, for any cE S. (A.4.1) 


Of course, if a set of real numbers has a minimal element (for example, 
if the set is finite), then this minimal element is the infimum. However, 
infimums exist even for sets without minimal elements. For example, if S 
is the set of all positive real numbers, then inf S = 0, even though 0 ¢ S. 

A simple but useful property of infimums is the following. 


Proposition A.4.2. Let S be a non-empty set of real numbers which 
is bounded below. Let a = inf S. Then for any e > 0, there is s € S with 
a<xs<a+te. 


Proof. Suppose, to the contrary, that there is no such s. Clearly there is 
no s € S with s < a (otherwise a would not be a lower bound for S$); hence, 
it must be that all s € S satisfy s > a+ €. But in this case, a + € is a lower 
bound for S which is larger than a. This contradicts the assertion that a 
was the greatest lower bound for S. A 


For example, if S is the interval (5,20), then inf S = 5, and 5 ¢ S, but for 
any € > 0, there is x € S with z < 5 +€. 


Exercise A.4.3. (a) Compute inf{z € R : x > 10}. 
(b) Compute inf{q € Q : q > 10}. 
(c) Compute inf{q € Q : q > 10}. 


Exercise A.4.4. (a) Let R,S C R each be non-emtpy and bounded 
below. Prove that inf(R US) = min (inf R, inf $). 

(b) Prove that this formula continues to hold if R = @ and/or S = @. 

(c) State and prove a similar formula for sup(RU S). 


Exercise A.4.5. Suppose {an} — a. Prove that infn an <a < sup, Gn. 
[Hint: Use proof by contraction.] 


Two special kinds of limits involve infimums and supremums, namely the 
limit inferior and the limit superior, defined by lim: inf £n = lim infk>n £k 
n00 


and lim ene Lp = jim, SUP, >n Lk, respectively. Indeed, such limits always 


exist (though fey may be infinite). Furthermore, lim, £n exists if and only 


if lim inf £n = limsup zy. 
n n 
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Exercise A.4.6. Suppose {an} — a and {bn} —> b, with a < b. Let cn = 


an for n odd, and cn = bn for n even. Compute lim inf cn and lim sup cn. 


We shall sometimes use the “order” notations, O(-) and o(-). A quantity 
g(x) is said to be O(h(x)) as z — £ if lim sup, 9(x)/h(x) < oo. (Here 
£ can be oo, or —oo, or 0, or any other quantity. Also, h(x) can be any 
function, such as z, or x”, or 1/x, or 1.) Similarly, g(x) is said to be o(h(z)) 
asx — £ if limsup,_,, 9(z)/h(x) = 0. 


Exercise A.4.7. Prove that {an} — 0 if and only if a, = o(1) as n — oo. 


Finally, we note the following. We see by Exercise A.3.5(b) and induc- 
tion that if {z,,} are real numbers for n € N and 1 < k < K < œ, with 
liMn—oo Ink = O for each k, then limn—>oo D Tnk = 0 as well. However, 
if there are an infinite number of different k then this is not true in general 
(for example, suppose £nn = 1 but Zp, = 0 for k Æ n). Still, the following 
proposition gives a useful condition under which this conclusion holds. 


Proposition A.4.8. (The M-test.) Let {£nk}n,ken be a collection of real 
numbers. Suppose that limy_.o9 Tnk = ak for each fixed k € N. Suppose 
further that Yg 8UPp |Enk| < co. Then liMmp—oo Xpo] Tnk = X g] ak- 


Proof. The hypotheses imply that S772, |ax| < oo. Hence, by replacing 
Tnk by Ink — Ax, it suffice to assume that a, = 0 for all k. 

Fix e > 0. Since S772, sup, nk < 00, we can find K € N such 
that yy 41 SUP, Tnk < €/2. Since limpootnk = 0, we can find (for 
k = 1,2,..., K) numbers Ny with rp, < €/2K for all n > Ny. Let N = 
max(Nj,...,Nx). Then for n > N, we have Pgo; Tnk < Kae +$ =e. 

Hence, limno Dpc Tnk < Pga ak. Similarly, lity, 66.5 1 Tnk 2 
De ak- The result follows. 


If limn—oo nk = ak for each fixed k € N, with nk > 0, but if we do 
not know that 7—1 sup, Znk < 00, then we still have 


oO oO 
jim Do Tnk > Xar, (4.4.9) 
k=1 k=1 


assuming this limit exists. Indeed, if not then we could find some finite 
K EN with limpo Sean Ink < ee ak, contradicting the fact that 


: co . K K 
limp oo et Ink 2 liMp—oo esi Tnk = De ak. 
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A.5. Equivalence relations. 


In one place in the notes (the proof of Proposition 1.2.6), the idea of an 
equivalence relation is used. Thus, we briefly review equivalence relations 
here. 

A relation on a set S is a boolean function on S x S; that is, given 
x,y € S, either z is related to y (written x ~ y), or x is not related to y 
(written x % y). A relation is an equivalence relation if (a) it is reflexive, 
Le. z ~ g for all x € S; (b) it is symmetric, i.e. x£ ~ y whenever y ~ x; and 
(c) it is transitive, i.e. x ~ z whenever z ~ y and y ~ z. 

Given an equivalence relation ~ and an element x € S, the equivalence 
class of x is the set of all y € S such that y ~ x. It is straightforward 
to verify that, if ~ is an equivalence relation, then any pair of equivalence 
classes is either identical or disjoint. It follows that, given an equivalence 
relation, the collection of equivalence classes form a partition of the set S. 
This fact is used in the proof of Proposition 1.2.6 herein. 


Exercise A.5.1. For each of the following relations on S = Z, determine 
whether or not it is an equivalence class; and if it is, then find the equivalence 
class of the element 1: 

(a) xz ~ y if and only if |y — z| is an integer multiple of 3. 

(b) xz ~y if and only if |y — z| < 5. 

(c) x ~ y if and only if |z|, |y| < 5. 

(d) z ~ y if and only if either z = y, or |æ], |y| < 5 (or both). 

(e) z ~ y if and only if |z| = ly]. 

(£) x ~y if and only if |z] > |y|. 
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stationary, 89, 180, 184 Fourier transform, 125 
uniform, 2, 9, 16 Fourier uniqueness, 129 
divergence, 203 Fubini’s theorem, 110 
of series, 204 function, 200 
domain, 200 analytic, 108 
dominated convergence theorem, convex, 58, 65, 173 
104, 165 density, 70, 144 
dominated measure (<<), 143, 147 distribution, 67 
double ’til you win, 78 identity, 201 
doubly stochastic Markov chain, indicator, 43, 200 
100 Lipschitz, 188 
drift, 190 measurable, 29, 30 
drift (of diffusion), 190 random, 186 


dyadic rational, 189 
gambler’s ruin, 75, 175 


Ehrenfest’s urn, 84 gambling policy, 78 

empty set, 199 double ’til you win, 78 

equivalence class, 3, 207 gamma distribution, 114 

equivalence relation, 3, 207 gamma function, 115 

equivalent martingale measure, 195 generalised triangle inequality, 44 

European call option, 195 generated o-algebra, 16 

evaluation map, 69 generator, 183, 190, 191 

event, 7 greatest integer not exceeding, 47 
decreasing, 33 greatest lower bound, 204 
increasing, 33 

expected value, 1, 43, 45 Hahn decomposition, 144 
conditional, 152, 155 Heine-Borel Theorem, 16 
infinite, 45, 49 Helly selection principle, 117, 129 
linearity of, 43, 46, 47, 50 hitting time, 75, 180 


order-preserving, 44, 46, 49 
exponential distribution, 141 i.i.d., 62 
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1.0., 34, 86 
identically distributed, 62 
identity function, 201 
image, 200 
inverse, 200 
inclusion-exclusion, principle of, 8, 
53 
increasing events, 33 
independence, 31, 32 
of sums, 39 
indicator function, 43, 200 
inequality 
Cantelli’s, 65 
Cauchy-Schwarz, 58, 65 
Chebychev’s, 57 
Jensen, 65, 173 
Jensen’s, 58 
Markov’s, 57 
maximal, 171 
infimum, 204 
infinite expected value, 45, 49 
infinite fair coin tossing, 22 
infinitely divisible, 136 
infinitely often, 34, 86 
initial distribution, 83, 179, 183 
integers (Z), 200 
positive (N), 200 
integrable 
Lebesgue, 51 
Riemann, 50, 51 
uniformly, 105 
integral, 50 
Itô, 191 
iterated, 110 
Lebesgue, 51 
lower, 50 
Riemann, 50 
stochastic, 191 
upper, 50 
interest rate, 195 
intersection, 199 
interval, 9, 15 
inverse image, 200 
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irreducibility, 88, 180 
Ito integral, 191 
Itd’s lemma, 194 
iterated integral, 110 
iterated logarithm, 88 


Jensen’s inequality, 58, 65, 173 
jump rate, 184 


Kolmogorov consistency conditions, 
177, 178 

Kolmogorov existence theorem, 178, 
182 

Kolmogorov zero-one law, 37 


Langevin diffusion, 192 
large deviations, 108, 109 
law, see distribution 
law of large numbers 
strong, 60, 62 
weak, 60, 64 
law of the iterated logarithm, 88 
least upper bound, 204 
Lebesgue decomposition, 143, 147 
Lebesgue integral, 51 
Lebesgue measure, 16, 50, 51 
liminf, 205 
limit, 202 
inferior, 205 
superior, 205 
limsup, 205 
Lindeberg CLT, 136 
linearity of expected values, 43, 
46, 47, 50 
countable, 48, 111 
Lipschitz function, 188 
lower bound, 204 
greatest, 204 
lower integral, 50 


M-test, 206 

Markov chain, 83, 179 
decomposable, 101 

Markov process, 183 
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Markov’s inequality, 57 
martingale, 76, 161 
central limit theorem, 172 
convergence theorem, 169 
maximal inequality, 171 
sampling theorem, 163 
mathematical background, see back- 
ground, mathematical 
maximal inequality, 171 
mean, see expected value 
mean return time, 94 
measurable 
function, 29, 30 
rectangle, 22 
set, 4, 7 
space, 31 
measure 
o-finite, 52, 147, 182 
counting, 5, 182 
Lebesgue, 16, 50, 51 
outer, 11 
probability, 7 
product, 22, 23 
signed, 144, 146 
measure space, see probability space 
method of moments, 139 
mixture distribution, 143 
moment, 45, 108, 137 
moment generating function, 107 
monotone convergence theorem, 46 
monotonicity, 8, 11, 46 
countable, 10 


natural numbers, 200 

normal distribution, 1, 69, 70, 133, 
134, 141, 193 

null recurrence, 94 


option, 195 

optional sampling theorem, 163 
order statistics, 179 
order-preserving, 44, 46, 49 
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Ornstein-Uhlenbeck process, 193, 
194 
outer measure, 11 


partition, 43, 199 
period of Markov chain, 90, 180 
periodicity, 90, 181 
permutation, 177, 179 
persistence, 86 
point mass, 69 
Poisson distribution, 1, 8, 69, 141, 
185 
Poisson process, 185 
policy, see gambling policy 
Polish space, 156 
positive recurrence, 94 
prerequisites, 8, 199 
price, fair, 195 
principle of inclusion-exclusion, 8, 
53 
probability, 1 
conditional, 84, 152, 155 
probability measure, 7 
defined on an algebra, 25 
probability space, 7 
complete, 15, 16 
discrete, 9, 20 
probability triple, 7 
process 
Markov, 183 
stochastic, 182 
process, stochastic, 73 
product measure, 22, 23 
product set, 199 
projection operator, 157 


Radon-Nikodym derivative, 144, 
148 
Radon-Nikodym theorem, 144, 148 
random function, 186 
random variable, 29 
absolutely continuous, 1, 70 
discrete, 1, 69 
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simple, 43 
random walk, 75, 87, 97, 161, 166 
range, 43, 200 
rate, jump, 184 
rational numbers (Q), 200 
real numbers (R), 200 
recurrence, 86, 88 

null, 94 

positive, 94 
reducibility, 88 
reflexive relation, 207 


regular conditional distribution, 156 


relation, 207 
equivalence, 207 
reflexive, 207 
symmetric, 207 
transitive, 207 
return time, 94 
reversibility, 89, 184, 192 
Riemann integrable, 50, 51 
right-continuous, 67 
risk-free interest rate, 195 
risk-neutral measure, 195 
Russell’s paradox, 199 


sample paths, continuous, 188, 189 

sample space, 7 

semialgebra, 9, 10 

semigroup, 183 

set, 199 
Borel, 16 
boundary, 117 
Cantor, 16, 81 
cardinality, 200 
complement, 199 
countable, 200 
difference, 199 
disjoint, 8, 199 
empty, 199 
finite, 200 
intersection, 199 
measurable, 4, 7 
product, 199 
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uncountable, 200 
union, 199 
disjoint, 8, 199 
universal, 199 
shift, 3 
sigma-algebra, see o-algebra 
sigma-field, see o-algebra 
sigma-finite measure, see o-finite 
measure 
sign(@), 127 
signed measure, 144, 146 
simple random variable, 43 
simple random walk, 75, 97 
symmetric, 87, 161 
singular measure, 143, 144, 148 
Skorohod’s theorem, 117 
state space, 83, 179 
stationary distribution, 89, 180, 
184 
Sterling’s approximation, 87 
stochastic integral, 191 
stochastic process, 73, 177, 182 
continuous time, 182 
stock, 195 
stock options, 194, 195 
value of, 195 
stopping time, 162 
strong law of large numbers, 60, 
62 
sub-o-algebra, 152 
subadditivity, 8, 11 
submartingale, 161 
convergence theorem, 169 
sampling theorem, 163 
subsequence, 204 
subset, 199 
superadditivity, 10 
supermartingale, 161 
supremum, 204 
symmetric relation, 207 


tightness, 130 
Tonelli’s theorem, 110 
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total variation distance, 181 

transience, 86 

transition matrix, 83, 84, 89 

transition probabilities, 83, 179 
n order, 85 

transitive relation, 207 

triangle inequality, 44 

triangular array, 136 

truncation argument, 62 

Tychonov’s Theorem, 21 


uncountable additivity, 2, 24 
uncountable set, 200 
uniform distribution, 2, 9, 16 
uniform integrability, 105 

convergence theorem, 105, 165 
uniformly bounded, 78 
union, 199 

disjoint, 8, 199 
universal set, 199 
upcrossing lemma, 169 
upper bound, 204 

least, 204 
upper integral, 50 


value of stock option, 195 
vanishing at infinity, 117, 122 
variable, random, see random vari- 
able 

variance (Var), 44 

conditional, 157 
volatility, 190, 195 
volume, 23 


w.p. 1, see convergence 

Wald’s theorem, 166, 175, 176 
weak convergence, 117, 120 

weak law of large numbers, 60, 64 
weak* topology, 117 

Wiener process, 186 


zero-one law, Kolmogorov, 37 
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A FIRST LOOK AT 


RIGOROUS PROBABILITY THEORY 


his textbook is an introduction to probability theory using 
measure theory. It is designed for graduate students in a variety 
of fields (mathematics, statistics, economics, management, finance, 
computer science, and engineering) who require a working knowledge 
of probability theory that is mathematically precise, but without excessive 
technicalities. The text provides complete proofs of all the essential 
introductory results. Nevertheless, the treatment is focused and 
accessible, with the measure theory and mathematical details presented 
in terms of intuitive probabilistic concepts, rather than as separate, 
imposing subjects. In this new edition, many exercises and small 
additional topics have been added and existing ones expanded. The 
text strikes an appropriate balance, rigorously developing probability 
theory while avoiding unnecessary detail. 
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